All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/57] tcg: Improve atomicity support
@ 2023-05-03  7:05 Richard Henderson
  2023-05-03  7:06 ` [PATCH v4 01/57] include/exec/memop: Add bits describing atomicity Richard Henderson
                   ` (57 more replies)
  0 siblings, 58 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:05 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

v1: https://lore.kernel.org/qemu-devel/20221118094754.242910-1-richard.henderson@linaro.org/
v2: https://lore.kernel.org/qemu-devel/20230216025739.1211680-1-richard.henderson@linaro.org/
v3: https://lore.kernel.org/qemu-devel/20230425193146.2106111-1-richard.henderson@linaro.org/

Based-on: 20230503065729.1745843-1-richard.henderson@linaro.org
("[PATCH v4 00/54] tcg: Simplify calls to load/store helpers")

The main objective here is to support Arm FEAT_LSE2, which says that any
single memory access that does not cross a 16-byte boundary is atomic.
This is the MO_ATOM_WITHIN16 control.

While I'm touching all of this, a secondary objective is to handle the
atomicity of the IBM machines.  Both Power and s390x treat misaligned
accesses as atomic on the lsb of the pointer.  For instance, an 8-byte
access at ptr % 8 == 4 will appear as two atomic 4-byte accesses, and
ptr % 4 == 2 will appear as four 2-byte accesses.
This is the MO_ATOM_SUBALIGN control.

By default, acceses are atomic only if aligned, which is the current
behaviour of the tcg code generator (mostly, anyway, there were bugs).
This is the MO_ATOM_IFALIGN control.

Further, one can say that a large memory access is really a set of
contiguous smaller accesses, and we need not provide more atomicity
than that (modulo MO_ATOM_WITHIN16).  This is the MO_ATMAX_* control.

Changes for v4:
  - Rebase, fixing some conflicts.


r~


Richard Henderson (57):
  include/exec/memop: Add bits describing atomicity
  accel/tcg: Add cpu_in_serial_context
  accel/tcg: Introduce tlb_read_idx
  accel/tcg: Reorg system mode load helpers
  accel/tcg: Reorg system mode store helpers
  accel/tcg: Honor atomicity of loads
  accel/tcg: Honor atomicity of stores
  target/loongarch: Do not include tcg-ldst.h
  tcg: Unify helper_{be,le}_{ld,st}*
  accel/tcg: Implement helper_{ld,st}*_mmu for user-only
  tcg/tci: Use helper_{ld,st}*_mmu for user-only
  tcg: Add 128-bit guest memory primitives
  meson: Detect atomic128 support with optimization
  tcg/i386: Add have_atomic16
  accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc
  accel/tcg: Add aarch64 specific support in ldst_atomicity
  tcg/aarch64: Detect have_lse, have_lse2 for linux
  tcg/aarch64: Detect have_lse, have_lse2 for darwin
  accel/tcg: Add have_lse2 support in ldst_atomicity
  tcg: Introduce TCG_OPF_TYPE_MASK
  tcg/i386: Use full load/store helpers in user-only mode
  tcg/aarch64: Use full load/store helpers in user-only mode
  tcg/ppc: Use full load/store helpers in user-only mode
  tcg/loongarch64: Use full load/store helpers in user-only mode
  tcg/riscv: Use full load/store helpers in user-only mode
  tcg/arm: Adjust constraints on qemu_ld/st
  tcg/arm: Use full load/store helpers in user-only mode
  tcg/mips: Use full load/store helpers in user-only mode
  tcg/s390x: Use full load/store helpers in user-only mode
  tcg/sparc64: Allocate %g2 as a third temporary
  tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13
  tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32
  tcg/sparc64: Split out tcg_out_movi_s32
  tcg/sparc64: Use standard slow path for softmmu
  accel/tcg: Remove helper_unaligned_{ld,st}
  tcg/loongarch64: Assert the host supports unaligned accesses
  tcg/loongarch64: Support softmmu unaligned accesses
  tcg/riscv: Support softmmu unaligned accesses
  tcg: Introduce tcg_target_has_memory_bswap
  tcg: Add INDEX_op_qemu_{ld,st}_i128
  tcg: Support TCG_TYPE_I128 in tcg_out_{ld,st}_helper_{args,ret}
  tcg: Introduce atom_and_align_for_opc
  tcg/i386: Use atom_and_align_for_opc
  tcg/aarch64: Use atom_and_align_for_opc
  tcg/arm: Use atom_and_align_for_opc
  tcg/loongarch64: Use atom_and_align_for_opc
  tcg/mips: Use atom_and_align_for_opc
  tcg/ppc: Use atom_and_align_for_opc
  tcg/riscv: Use atom_and_align_for_opc
  tcg/s390x: Use atom_and_align_for_opc
  tcg/sparc64: Use atom_and_align_for_opc
  tcg/i386: Honor 64-bit atomicity in 32-bit mode
  tcg/i386: Support 128-bit load/store with have_atomic16
  tcg/aarch64: Rename temporaries
  tcg/aarch64: Support 128-bit load/store
  tcg/ppc: Support 128-bit load/store
  tcg/s390x: Support 128-bit load/store

 accel/tcg/internal.h             |    5 +
 accel/tcg/tcg-runtime.h          |    3 +
 include/exec/cpu-defs.h          |    7 +-
 include/exec/cpu_ldst.h          |   26 +-
 include/exec/memop.h             |   36 +
 include/qemu/cpuid.h             |   18 +
 include/tcg/tcg-ldst.h           |   72 +-
 include/tcg/tcg-opc.h            |    8 +
 include/tcg/tcg.h                |   22 +-
 tcg/aarch64/tcg-target-con-set.h |    2 +
 tcg/aarch64/tcg-target.h         |    6 +-
 tcg/arm/tcg-target-con-set.h     |   16 +-
 tcg/arm/tcg-target-con-str.h     |    5 +-
 tcg/arm/tcg-target.h             |    3 +-
 tcg/i386/tcg-target.h            |    7 +-
 tcg/loongarch64/tcg-target.h     |    3 +-
 tcg/mips/tcg-target.h            |    4 +-
 tcg/ppc/tcg-target-con-set.h     |    2 +
 tcg/ppc/tcg-target-con-str.h     |    1 +
 tcg/ppc/tcg-target.h             |    4 +-
 tcg/riscv/tcg-target.h           |    4 +-
 tcg/s390x/tcg-target-con-set.h   |    2 +
 tcg/s390x/tcg-target.h           |    4 +-
 tcg/sparc64/tcg-target-con-set.h |    2 -
 tcg/sparc64/tcg-target-con-str.h |    1 -
 tcg/sparc64/tcg-target.h         |    4 +-
 tcg/tcg-internal.h               |    2 +
 tcg/tci/tcg-target.h             |    4 +-
 accel/tcg/cpu-exec-common.c      |    3 +
 accel/tcg/cputlb.c               | 1799 ++++++++++++++++++------------
 accel/tcg/tb-maint.c             |    2 +-
 accel/tcg/user-exec.c            |  489 +++++---
 target/loongarch/csr_helper.c    |    1 -
 target/loongarch/iocsr_helper.c  |    1 -
 tcg/optimize.c                   |   15 +-
 tcg/tcg-op.c                     |  265 +++--
 tcg/tcg.c                        |  270 ++++-
 tcg/tci.c                        |  150 +--
 accel/tcg/ldst_atomicity.c.inc   | 1373 +++++++++++++++++++++++
 docs/devel/loads-stores.rst      |   36 +-
 docs/devel/tcg-ops.rst           |   11 +-
 meson.build                      |   52 +-
 tcg/aarch64/tcg-target.c.inc     |  384 +++++--
 tcg/arm/tcg-target.c.inc         |  121 +-
 tcg/i386/tcg-target.c.inc        |  366 ++++--
 tcg/loongarch64/tcg-target.c.inc |   91 +-
 tcg/mips/tcg-target.c.inc        |  104 +-
 tcg/ppc/tcg-target.c.inc         |  261 +++--
 tcg/riscv/tcg-target.c.inc       |  132 +--
 tcg/s390x/tcg-target.c.inc       |  177 ++-
 tcg/sparc64/tcg-target.c.inc     |  714 ++++--------
 tcg/tci/tcg-target.c.inc         |    8 +-
 52 files changed, 4663 insertions(+), 2435 deletions(-)
 create mode 100644 accel/tcg/ldst_atomicity.c.inc

-- 
2.34.1



^ permalink raw reply	[flat|nested] 132+ messages in thread

* [PATCH v4 01/57] include/exec/memop: Add bits describing atomicity
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-04 14:49   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 02/57] accel/tcg: Add cpu_in_serial_context Richard Henderson
                   ` (56 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x, Alex Bennée

These bits may be used to describe the precise atomicity
requirements of the guest, which may then be used to
constrain the methods by which it may be emulated by the host.

For instance, the AArch64 LDP (32-bit) instruction changes
semantics with ARMv8.4 LSE2, from

  MO_64 | MO_ATMAX_4 | MO_ATOM_IFALIGN
  (64-bits, single-copy atomic only on 4 byte units,
   nonatomic if not aligned by 4),

to

  MO_64 | MO_ATMAX_SIZE | MO_ATOM_WITHIN16
  (64-bits, single-copy atomic within a 16 byte block)

The former may be implemented with two 4 byte loads, or
a single 8 byte load if that happens to be efficient on
the host.  The latter may not, and may also require a
helper when misaligned.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/exec/memop.h | 36 ++++++++++++++++++++++++++++++++++++
 1 file changed, 36 insertions(+)

diff --git a/include/exec/memop.h b/include/exec/memop.h
index 25d027434a..04e4048f0b 100644
--- a/include/exec/memop.h
+++ b/include/exec/memop.h
@@ -81,6 +81,42 @@ typedef enum MemOp {
     MO_ALIGN_32 = 5 << MO_ASHIFT,
     MO_ALIGN_64 = 6 << MO_ASHIFT,
 
+    /*
+     * MO_ATOM_* describes that atomicity requirements of the operation:
+     * MO_ATOM_IFALIGN: the operation must be single-copy atomic if and
+     *    only if it is aligned; if unaligned there is no atomicity.
+     * MO_ATOM_NONE: the operation has no atomicity requirements.
+     * MO_ATOM_SUBALIGN: the operation is single-copy atomic by parts
+     *    by the alignment.  E.g. if the address is 0 mod 4, then each
+     *    4-byte subobject is single-copy atomic.
+     *    This is the atomicity of IBM Power and S390X processors.
+     * MO_ATOM_WITHIN16: the operation is single-copy atomic, even if it
+     *    is unaligned, so long as it does not cross a 16-byte boundary;
+     *    if it crosses a 16-byte boundary there is no atomicity.
+     *    This is the atomicity of Arm FEAT_LSE2.
+     *
+     * MO_ATMAX_* describes the maximum atomicity unit required:
+     * MO_ATMAX_SIZE: the entire operation, i.e. MO_SIZE.
+     * MO_ATMAX_[248]: units of N bytes.
+     *
+     * Note the default (i.e. 0) values are single-copy atomic to the
+     * size of the operation, if aligned.  This retains the behaviour
+     * from before these were introduced.
+     */
+    MO_ATOM_SHIFT    = 8,
+    MO_ATOM_MASK     = 0x3 << MO_ATOM_SHIFT,
+    MO_ATOM_IFALIGN  = 0 << MO_ATOM_SHIFT,
+    MO_ATOM_NONE     = 1 << MO_ATOM_SHIFT,
+    MO_ATOM_SUBALIGN = 2 << MO_ATOM_SHIFT,
+    MO_ATOM_WITHIN16 = 3 << MO_ATOM_SHIFT,
+
+    MO_ATMAX_SHIFT = 10,
+    MO_ATMAX_MASK  = 0x3 << MO_ATMAX_SHIFT,
+    MO_ATMAX_SIZE  = 0 << MO_ATMAX_SHIFT,
+    MO_ATMAX_2     = 1 << MO_ATMAX_SHIFT,
+    MO_ATMAX_4     = 2 << MO_ATMAX_SHIFT,
+    MO_ATMAX_8     = 3 << MO_ATMAX_SHIFT,
+
     /* Combinations of the above, for ease of use.  */
     MO_UB    = MO_8,
     MO_UW    = MO_16,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 02/57] accel/tcg: Add cpu_in_serial_context
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
  2023-05-03  7:06 ` [PATCH v4 01/57] include/exec/memop: Add bits describing atomicity Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-03  7:06 ` [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx Richard Henderson
                   ` (55 subsequent siblings)
  57 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x, Alex Bennée

Like cpu_in_exclusive_context, but also true if
there is no other cpu against which we could race.

Use it in tb_flush as a direct replacement.
Use it in cpu_loop_exit_atomic to ensure that there
is no loop against cpu_exec_step_atomic.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/internal.h        | 5 +++++
 accel/tcg/cpu-exec-common.c | 3 +++
 accel/tcg/tb-maint.c        | 2 +-
 3 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/accel/tcg/internal.h b/accel/tcg/internal.h
index 7bb0fdbe14..8ca24420ea 100644
--- a/accel/tcg/internal.h
+++ b/accel/tcg/internal.h
@@ -64,6 +64,11 @@ static inline target_ulong log_pc(CPUState *cpu, const TranslationBlock *tb)
     }
 }
 
+static inline bool cpu_in_serial_context(CPUState *cs)
+{
+    return !(cs->tcg_cflags & CF_PARALLEL) || cpu_in_exclusive_context(cs);
+}
+
 extern int64_t max_delay;
 extern int64_t max_advance;
 
diff --git a/accel/tcg/cpu-exec-common.c b/accel/tcg/cpu-exec-common.c
index e7962c9348..9a5fabf625 100644
--- a/accel/tcg/cpu-exec-common.c
+++ b/accel/tcg/cpu-exec-common.c
@@ -22,6 +22,7 @@
 #include "sysemu/tcg.h"
 #include "exec/exec-all.h"
 #include "qemu/plugin.h"
+#include "internal.h"
 
 bool tcg_allowed;
 
@@ -81,6 +82,8 @@ void cpu_loop_exit_restore(CPUState *cpu, uintptr_t pc)
 
 void cpu_loop_exit_atomic(CPUState *cpu, uintptr_t pc)
 {
+    /* Prevent looping if already executing in a serial context. */
+    g_assert(!cpu_in_serial_context(cpu));
     cpu->exception_index = EXCP_ATOMIC;
     cpu_loop_exit_restore(cpu, pc);
 }
diff --git a/accel/tcg/tb-maint.c b/accel/tcg/tb-maint.c
index cb1f806f00..7d613d36d2 100644
--- a/accel/tcg/tb-maint.c
+++ b/accel/tcg/tb-maint.c
@@ -760,7 +760,7 @@ void tb_flush(CPUState *cpu)
     if (tcg_enabled()) {
         unsigned tb_flush_count = qatomic_mb_read(&tb_ctx.tb_flush_count);
 
-        if (cpu_in_exclusive_context(cpu)) {
+        if (cpu_in_serial_context(cpu)) {
             do_tb_flush(cpu, RUN_ON_CPU_HOST_INT(tb_flush_count));
         } else {
             async_safe_run_on_cpu(cpu, do_tb_flush,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
  2023-05-03  7:06 ` [PATCH v4 01/57] include/exec/memop: Add bits describing atomicity Richard Henderson
  2023-05-03  7:06 ` [PATCH v4 02/57] accel/tcg: Add cpu_in_serial_context Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-04 15:02   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 04/57] accel/tcg: Reorg system mode load helpers Richard Henderson
                   ` (54 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x, Alex Bennée

Instead of playing with offsetof in various places, use
MMUAccessType to index an array.  This is easily defined
instead of the previous dummy padding array in the union.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/exec/cpu-defs.h |   7 ++-
 include/exec/cpu_ldst.h |  26 ++++++++--
 accel/tcg/cputlb.c      | 104 +++++++++++++---------------------------
 3 files changed, 59 insertions(+), 78 deletions(-)

diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index e1c498ef4b..a6e0cf1812 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -111,8 +111,11 @@ typedef struct CPUTLBEntry {
                use the corresponding iotlb value.  */
             uintptr_t addend;
         };
-        /* padding to get a power of two size */
-        uint8_t dummy[1 << CPU_TLB_ENTRY_BITS];
+        /*
+         * Padding to get a power of two size, as well as index
+         * access to addr_{read,write,code}.
+         */
+        target_ulong addr_idx[(1 << CPU_TLB_ENTRY_BITS) / TARGET_LONG_SIZE];
     };
 } CPUTLBEntry;
 
diff --git a/include/exec/cpu_ldst.h b/include/exec/cpu_ldst.h
index c141f0394f..7c867c94c3 100644
--- a/include/exec/cpu_ldst.h
+++ b/include/exec/cpu_ldst.h
@@ -360,13 +360,29 @@ static inline void clear_helper_retaddr(void)
 /* Needed for TCG_OVERSIZED_GUEST */
 #include "tcg/tcg.h"
 
+static inline target_ulong tlb_read_idx(const CPUTLBEntry *entry,
+                                        MMUAccessType access_type)
+{
+    /* Do not rearrange the CPUTLBEntry structure members. */
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_read) !=
+                      MMU_DATA_LOAD * TARGET_LONG_SIZE);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_write) !=
+                      MMU_DATA_STORE * TARGET_LONG_SIZE);
+    QEMU_BUILD_BUG_ON(offsetof(CPUTLBEntry, addr_code) !=
+                      MMU_INST_FETCH * TARGET_LONG_SIZE);
+
+    const target_ulong *ptr = &entry->addr_idx[access_type];
+#if TCG_OVERSIZED_GUEST
+    return *ptr;
+#else
+    /* ofs might correspond to .addr_write, so use qatomic_read */
+    return qatomic_read(ptr);
+#endif
+}
+
 static inline target_ulong tlb_addr_write(const CPUTLBEntry *entry)
 {
-#if TCG_OVERSIZED_GUEST
-    return entry->addr_write;
-#else
-    return qatomic_read(&entry->addr_write);
-#endif
+    return tlb_read_idx(entry, MMU_DATA_STORE);
 }
 
 /* Find the TLB index corresponding to the mmu_idx + address pair.  */
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 3117886af1..5051244c67 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1441,34 +1441,17 @@ static void io_writex(CPUArchState *env, CPUTLBEntryFull *full,
     }
 }
 
-static inline target_ulong tlb_read_ofs(CPUTLBEntry *entry, size_t ofs)
-{
-#if TCG_OVERSIZED_GUEST
-    return *(target_ulong *)((uintptr_t)entry + ofs);
-#else
-    /* ofs might correspond to .addr_write, so use qatomic_read */
-    return qatomic_read((target_ulong *)((uintptr_t)entry + ofs));
-#endif
-}
-
 /* Return true if ADDR is present in the victim tlb, and has been copied
    back to the main tlb.  */
 static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
-                           size_t elt_ofs, target_ulong page)
+                           MMUAccessType access_type, target_ulong page)
 {
     size_t vidx;
 
     assert_cpu_is_self(env_cpu(env));
     for (vidx = 0; vidx < CPU_VTLB_SIZE; ++vidx) {
         CPUTLBEntry *vtlb = &env_tlb(env)->d[mmu_idx].vtable[vidx];
-        target_ulong cmp;
-
-        /* elt_ofs might correspond to .addr_write, so use qatomic_read */
-#if TCG_OVERSIZED_GUEST
-        cmp = *(target_ulong *)((uintptr_t)vtlb + elt_ofs);
-#else
-        cmp = qatomic_read((target_ulong *)((uintptr_t)vtlb + elt_ofs));
-#endif
+        target_ulong cmp = tlb_read_idx(vtlb, access_type);
 
         if (cmp == page) {
             /* Found entry in victim tlb, swap tlb and iotlb.  */
@@ -1490,11 +1473,6 @@ static bool victim_tlb_hit(CPUArchState *env, size_t mmu_idx, size_t index,
     return false;
 }
 
-/* Macro to call the above, with local variables from the use context.  */
-#define VICTIM_TLB_HIT(TY, ADDR) \
-  victim_tlb_hit(env, mmu_idx, index, offsetof(CPUTLBEntry, TY), \
-                 (ADDR) & TARGET_PAGE_MASK)
-
 static void notdirty_write(CPUState *cpu, vaddr mem_vaddr, unsigned size,
                            CPUTLBEntryFull *full, uintptr_t retaddr)
 {
@@ -1527,29 +1505,12 @@ static int probe_access_internal(CPUArchState *env, target_ulong addr,
 {
     uintptr_t index = tlb_index(env, mmu_idx, addr);
     CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
-    target_ulong tlb_addr, page_addr;
-    size_t elt_ofs;
-    int flags;
+    target_ulong tlb_addr = tlb_read_idx(entry, access_type);
+    target_ulong page_addr = addr & TARGET_PAGE_MASK;
+    int flags = TLB_FLAGS_MASK;
 
-    switch (access_type) {
-    case MMU_DATA_LOAD:
-        elt_ofs = offsetof(CPUTLBEntry, addr_read);
-        break;
-    case MMU_DATA_STORE:
-        elt_ofs = offsetof(CPUTLBEntry, addr_write);
-        break;
-    case MMU_INST_FETCH:
-        elt_ofs = offsetof(CPUTLBEntry, addr_code);
-        break;
-    default:
-        g_assert_not_reached();
-    }
-    tlb_addr = tlb_read_ofs(entry, elt_ofs);
-
-    flags = TLB_FLAGS_MASK;
-    page_addr = addr & TARGET_PAGE_MASK;
     if (!tlb_hit_page(tlb_addr, page_addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, elt_ofs, page_addr)) {
+        if (!victim_tlb_hit(env, mmu_idx, index, access_type, page_addr)) {
             CPUState *cs = env_cpu(env);
 
             if (!cs->cc->tcg_ops->tlb_fill(cs, addr, fault_size, access_type,
@@ -1571,7 +1532,7 @@ static int probe_access_internal(CPUArchState *env, target_ulong addr,
              */
             flags &= ~TLB_INVALID_MASK;
         }
-        tlb_addr = tlb_read_ofs(entry, elt_ofs);
+        tlb_addr = tlb_read_idx(entry, access_type);
     }
     flags &= tlb_addr;
 
@@ -1802,7 +1763,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
     if (prot & PAGE_WRITE) {
         tlb_addr = tlb_addr_write(tlbe);
         if (!tlb_hit(tlb_addr, addr)) {
-            if (!VICTIM_TLB_HIT(addr_write, addr)) {
+            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
+                                addr & TARGET_PAGE_MASK)) {
                 tlb_fill(env_cpu(env), addr, size,
                          MMU_DATA_STORE, mmu_idx, retaddr);
                 index = tlb_index(env, mmu_idx, addr);
@@ -1835,7 +1797,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
     } else /* if (prot & PAGE_READ) */ {
         tlb_addr = tlbe->addr_read;
         if (!tlb_hit(tlb_addr, addr)) {
-            if (!VICTIM_TLB_HIT(addr_write, addr)) {
+            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_LOAD,
+                                addr & TARGET_PAGE_MASK)) {
                 tlb_fill(env_cpu(env), addr, size,
                          MMU_DATA_LOAD, mmu_idx, retaddr);
                 index = tlb_index(env, mmu_idx, addr);
@@ -1929,13 +1892,9 @@ load_memop(const void *haddr, MemOp op)
 
 static inline uint64_t QEMU_ALWAYS_INLINE
 load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
-            uintptr_t retaddr, MemOp op, bool code_read,
+            uintptr_t retaddr, MemOp op, MMUAccessType access_type,
             FullLoadHelper *full_load)
 {
-    const size_t tlb_off = code_read ?
-        offsetof(CPUTLBEntry, addr_code) : offsetof(CPUTLBEntry, addr_read);
-    const MMUAccessType access_type =
-        code_read ? MMU_INST_FETCH : MMU_DATA_LOAD;
     const unsigned a_bits = get_alignment_bits(get_memop(oi));
     const size_t size = memop_size(op);
     uintptr_t mmu_idx = get_mmuidx(oi);
@@ -1955,18 +1914,18 @@ load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
 
     index = tlb_index(env, mmu_idx, addr);
     entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = code_read ? entry->addr_code : entry->addr_read;
+    tlb_addr = tlb_read_idx(entry, access_type);
 
     /* If the TLB entry is for a different page, reload and try again.  */
     if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
+        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
                             addr & TARGET_PAGE_MASK)) {
             tlb_fill(env_cpu(env), addr, size,
                      access_type, mmu_idx, retaddr);
             index = tlb_index(env, mmu_idx, addr);
             entry = tlb_entry(env, mmu_idx, addr);
         }
-        tlb_addr = code_read ? entry->addr_code : entry->addr_read;
+        tlb_addr = tlb_read_idx(entry, access_type);
         tlb_addr &= ~TLB_INVALID_MASK;
     }
 
@@ -2052,7 +2011,8 @@ static uint64_t full_ldub_mmu(CPUArchState *env, target_ulong addr,
                               MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_UB);
-    return load_helper(env, addr, oi, retaddr, MO_UB, false, full_ldub_mmu);
+    return load_helper(env, addr, oi, retaddr, MO_UB, MMU_DATA_LOAD,
+                       full_ldub_mmu);
 }
 
 tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
@@ -2065,7 +2025,7 @@ static uint64_t full_le_lduw_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUW);
-    return load_helper(env, addr, oi, retaddr, MO_LEUW, false,
+    return load_helper(env, addr, oi, retaddr, MO_LEUW, MMU_DATA_LOAD,
                        full_le_lduw_mmu);
 }
 
@@ -2079,7 +2039,7 @@ static uint64_t full_be_lduw_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUW);
-    return load_helper(env, addr, oi, retaddr, MO_BEUW, false,
+    return load_helper(env, addr, oi, retaddr, MO_BEUW, MMU_DATA_LOAD,
                        full_be_lduw_mmu);
 }
 
@@ -2093,7 +2053,7 @@ static uint64_t full_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUL);
-    return load_helper(env, addr, oi, retaddr, MO_LEUL, false,
+    return load_helper(env, addr, oi, retaddr, MO_LEUL, MMU_DATA_LOAD,
                        full_le_ldul_mmu);
 }
 
@@ -2107,7 +2067,7 @@ static uint64_t full_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUL);
-    return load_helper(env, addr, oi, retaddr, MO_BEUL, false,
+    return load_helper(env, addr, oi, retaddr, MO_BEUL, MMU_DATA_LOAD,
                        full_be_ldul_mmu);
 }
 
@@ -2121,7 +2081,7 @@ uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_LEUQ, false,
+    return load_helper(env, addr, oi, retaddr, MO_LEUQ, MMU_DATA_LOAD,
                        helper_le_ldq_mmu);
 }
 
@@ -2129,7 +2089,7 @@ uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_BEUQ, false,
+    return load_helper(env, addr, oi, retaddr, MO_BEUQ, MMU_DATA_LOAD,
                        helper_be_ldq_mmu);
 }
 
@@ -2325,7 +2285,6 @@ store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
                        uintptr_t retaddr, size_t size, uintptr_t mmu_idx,
                        bool big_endian)
 {
-    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
     uintptr_t index, index2;
     CPUTLBEntry *entry, *entry2;
     target_ulong page1, page2, tlb_addr, tlb_addr2;
@@ -2347,7 +2306,7 @@ store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
 
     tlb_addr2 = tlb_addr_write(entry2);
     if (page1 != page2 && !tlb_hit_page(tlb_addr2, page2)) {
-        if (!victim_tlb_hit(env, mmu_idx, index2, tlb_off, page2)) {
+        if (!victim_tlb_hit(env, mmu_idx, index2, MMU_DATA_STORE, page2)) {
             tlb_fill(env_cpu(env), page2, size2, MMU_DATA_STORE,
                      mmu_idx, retaddr);
             index2 = tlb_index(env, mmu_idx, page2);
@@ -2400,7 +2359,6 @@ static inline void QEMU_ALWAYS_INLINE
 store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
              MemOpIdx oi, uintptr_t retaddr, MemOp op)
 {
-    const size_t tlb_off = offsetof(CPUTLBEntry, addr_write);
     const unsigned a_bits = get_alignment_bits(get_memop(oi));
     const size_t size = memop_size(op);
     uintptr_t mmu_idx = get_mmuidx(oi);
@@ -2423,7 +2381,7 @@ store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
 
     /* If the TLB entry is for a different page, reload and try again.  */
     if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, tlb_off,
+        if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
             addr & TARGET_PAGE_MASK)) {
             tlb_fill(env_cpu(env), addr, size, MMU_DATA_STORE,
                      mmu_idx, retaddr);
@@ -2729,7 +2687,8 @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
 static uint64_t full_ldub_code(CPUArchState *env, target_ulong addr,
                                MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_8, true, full_ldub_code);
+    return load_helper(env, addr, oi, retaddr, MO_8,
+                       MMU_INST_FETCH, full_ldub_code);
 }
 
 uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
@@ -2741,7 +2700,8 @@ uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
 static uint64_t full_lduw_code(CPUArchState *env, target_ulong addr,
                                MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_TEUW, true, full_lduw_code);
+    return load_helper(env, addr, oi, retaddr, MO_TEUW,
+                       MMU_INST_FETCH, full_lduw_code);
 }
 
 uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
@@ -2753,7 +2713,8 @@ uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
 static uint64_t full_ldl_code(CPUArchState *env, target_ulong addr,
                               MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_TEUL, true, full_ldl_code);
+    return load_helper(env, addr, oi, retaddr, MO_TEUL,
+                       MMU_INST_FETCH, full_ldl_code);
 }
 
 uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
@@ -2765,7 +2726,8 @@ uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
 static uint64_t full_ldq_code(CPUArchState *env, target_ulong addr,
                               MemOpIdx oi, uintptr_t retaddr)
 {
-    return load_helper(env, addr, oi, retaddr, MO_TEUQ, true, full_ldq_code);
+    return load_helper(env, addr, oi, retaddr, MO_TEUQ,
+                       MMU_INST_FETCH, full_ldq_code);
 }
 
 uint64_t cpu_ldq_code(CPUArchState *env, abi_ptr addr)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 04/57] accel/tcg: Reorg system mode load helpers
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (2 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-04 15:39   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 05/57] accel/tcg: Reorg system mode store helpers Richard Henderson
                   ` (53 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x, Alex Bennée

Instead of trying to unify all operations on uint64_t, pull out
mmu_lookup() to perform the basic tlb hit and resolution.
Create individual functions to handle access by size.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/cputlb.c | 644 +++++++++++++++++++++++++++++----------------
 1 file changed, 423 insertions(+), 221 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 5051244c67..dd68514260 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1716,6 +1716,178 @@ bool tlb_plugin_lookup(CPUState *cpu, target_ulong addr, int mmu_idx,
 
 #endif
 
+/*
+ * Probe for a load/store operation.
+ * Return the host address and into @flags.
+ */
+
+typedef struct MMULookupPageData {
+    CPUTLBEntryFull *full;
+    void *haddr;
+    target_ulong addr;
+    int flags;
+    int size;
+} MMULookupPageData;
+
+typedef struct MMULookupLocals {
+    MMULookupPageData page[2];
+    MemOp memop;
+    int mmu_idx;
+} MMULookupLocals;
+
+/**
+ * mmu_lookup1: translate one page
+ * @env: cpu context
+ * @data: lookup parameters
+ * @mmu_idx: virtual address context
+ * @access_type: load/store/code
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Resolve the translation for the one page at @data.addr, filling in
+ * the rest of @data with the results.  If the translation fails,
+ * tlb_fill will longjmp out.  Return true if the softmmu tlb for
+ * @mmu_idx may have resized.
+ */
+static bool mmu_lookup1(CPUArchState *env, MMULookupPageData *data,
+                        int mmu_idx, MMUAccessType access_type, uintptr_t ra)
+{
+    target_ulong addr = data->addr;
+    uintptr_t index = tlb_index(env, mmu_idx, addr);
+    CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr);
+    target_ulong tlb_addr = tlb_read_idx(entry, access_type);
+    bool maybe_resized = false;
+
+    /* If the TLB entry is for a different page, reload and try again.  */
+    if (!tlb_hit(tlb_addr, addr)) {
+        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
+                            addr & TARGET_PAGE_MASK)) {
+            tlb_fill(env_cpu(env), addr, data->size, access_type, mmu_idx, ra);
+            maybe_resized = true;
+            index = tlb_index(env, mmu_idx, addr);
+            entry = tlb_entry(env, mmu_idx, addr);
+        }
+        tlb_addr = tlb_read_idx(entry, access_type) & ~TLB_INVALID_MASK;
+    }
+
+    data->flags = tlb_addr & TLB_FLAGS_MASK;
+    data->full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
+    /* Compute haddr speculatively; depending on flags it might be invalid. */
+    data->haddr = (void *)((uintptr_t)addr + entry->addend);
+
+    return maybe_resized;
+}
+
+/**
+ * mmu_watch_or_dirty
+ * @env: cpu context
+ * @data: lookup parameters
+ * @access_type: load/store/code
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Trigger watchpoints for @data.addr:@data.size;
+ * record writes to protected clean pages.
+ */
+static void mmu_watch_or_dirty(CPUArchState *env, MMULookupPageData *data,
+                               MMUAccessType access_type, uintptr_t ra)
+{
+    CPUTLBEntryFull *full = data->full;
+    target_ulong addr = data->addr;
+    int flags = data->flags;
+    int size = data->size;
+
+    /* On watchpoint hit, this will longjmp out.  */
+    if (flags & TLB_WATCHPOINT) {
+        int wp = access_type == MMU_DATA_STORE ? BP_MEM_WRITE : BP_MEM_READ;
+        cpu_check_watchpoint(env_cpu(env), addr, size, full->attrs, wp, ra);
+        flags &= ~TLB_WATCHPOINT;
+    }
+
+    if (flags & TLB_NOTDIRTY) {
+        notdirty_write(env_cpu(env), addr, size, full, ra);
+        flags &= ~TLB_NOTDIRTY;
+    }
+    data->flags = flags;
+}
+
+/**
+ * mmu_lookup: translate page(s)
+ * @env: cpu context
+ * @addr: virtual address
+ * @oi: combined mmu_idx and MemOp
+ * @ra: return address into tcg generated code, or 0
+ * @access_type: load/store/code
+ * @l: output result
+ *
+ * Resolve the translation for the page(s) beginning at @addr, for MemOp.size
+ * bytes.  Return true if the lookup crosses a page boundary.
+ */
+static bool mmu_lookup(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                       uintptr_t ra, MMUAccessType type, MMULookupLocals *l)
+{
+    unsigned a_bits;
+    bool crosspage;
+    int flags;
+
+    l->memop = get_memop(oi);
+    l->mmu_idx = get_mmuidx(oi);
+
+    tcg_debug_assert(l->mmu_idx < NB_MMU_MODES);
+
+    /* Handle CPU specific unaligned behaviour */
+    a_bits = get_alignment_bits(l->memop);
+    if (addr & ((1 << a_bits) - 1)) {
+        cpu_unaligned_access(env_cpu(env), addr, type, l->mmu_idx, ra);
+    }
+
+    l->page[0].addr = addr;
+    l->page[0].size = memop_size(l->memop);
+    l->page[1].addr = (addr + l->page[0].size - 1) & TARGET_PAGE_MASK;
+    l->page[1].size = 0;
+    crosspage = (addr ^ l->page[1].addr) & TARGET_PAGE_MASK;
+
+    if (likely(!crosspage)) {
+        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
+
+        flags = l->page[0].flags;
+        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
+            mmu_watch_or_dirty(env, &l->page[0], type, ra);
+        }
+        if (unlikely(flags & TLB_BSWAP)) {
+            l->memop ^= MO_BSWAP;
+        }
+    } else {
+        /* Finish compute of page crossing. */
+        int size1 = l->page[1].addr - addr;
+        l->page[1].size = l->page[0].size - size1;
+        l->page[0].size = size1;
+
+        /*
+         * Lookup both pages, recognizing exceptions from either.  If the
+         * second lookup potentially resized, refresh first CPUTLBEntryFull.
+         */
+        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
+        if (mmu_lookup1(env, &l->page[1], l->mmu_idx, type, ra)) {
+            uintptr_t index = tlb_index(env, l->mmu_idx, addr);
+            l->page[0].full = &env_tlb(env)->d[l->mmu_idx].fulltlb[index];
+        }
+
+        flags = l->page[0].flags | l->page[1].flags;
+        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
+            mmu_watch_or_dirty(env, &l->page[0], type, ra);
+            mmu_watch_or_dirty(env, &l->page[1], type, ra);
+        }
+
+        /*
+         * Since target/sparc is the only user of TLB_BSWAP, and all
+         * Sparc accesses are aligned, any treatment across two pages
+         * would be arbitrary.  Refuse it until there's a use.
+         */
+        tcg_debug_assert((flags & TLB_BSWAP) == 0);
+    }
+
+    return crosspage;
+}
+
 /*
  * Probe for an atomic operation.  Do not allow unaligned operations,
  * or io operations to proceed.  Return the host address.
@@ -1890,113 +2062,6 @@ load_memop(const void *haddr, MemOp op)
     }
 }
 
-static inline uint64_t QEMU_ALWAYS_INLINE
-load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
-            uintptr_t retaddr, MemOp op, MMUAccessType access_type,
-            FullLoadHelper *full_load)
-{
-    const unsigned a_bits = get_alignment_bits(get_memop(oi));
-    const size_t size = memop_size(op);
-    uintptr_t mmu_idx = get_mmuidx(oi);
-    uintptr_t index;
-    CPUTLBEntry *entry;
-    target_ulong tlb_addr;
-    void *haddr;
-    uint64_t res;
-
-    tcg_debug_assert(mmu_idx < NB_MMU_MODES);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, access_type,
-                             mmu_idx, retaddr);
-    }
-
-    index = tlb_index(env, mmu_idx, addr);
-    entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = tlb_read_idx(entry, access_type);
-
-    /* If the TLB entry is for a different page, reload and try again.  */
-    if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, access_type,
-                            addr & TARGET_PAGE_MASK)) {
-            tlb_fill(env_cpu(env), addr, size,
-                     access_type, mmu_idx, retaddr);
-            index = tlb_index(env, mmu_idx, addr);
-            entry = tlb_entry(env, mmu_idx, addr);
-        }
-        tlb_addr = tlb_read_idx(entry, access_type);
-        tlb_addr &= ~TLB_INVALID_MASK;
-    }
-
-    /* Handle anything that isn't just a straight memory access.  */
-    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUTLBEntryFull *full;
-        bool need_swap;
-
-        /* For anything that is unaligned, recurse through full_load.  */
-        if ((addr & (size - 1)) != 0) {
-            goto do_unaligned_access;
-        }
-
-        full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
-
-        /* Handle watchpoints.  */
-        if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
-            /* On watchpoint hit, this will longjmp out.  */
-            cpu_check_watchpoint(env_cpu(env), addr, size,
-                                 full->attrs, BP_MEM_READ, retaddr);
-        }
-
-        need_swap = size > 1 && (tlb_addr & TLB_BSWAP);
-
-        /* Handle I/O access.  */
-        if (likely(tlb_addr & TLB_MMIO)) {
-            return io_readx(env, full, mmu_idx, addr, retaddr,
-                            access_type, op ^ (need_swap * MO_BSWAP));
-        }
-
-        haddr = (void *)((uintptr_t)addr + entry->addend);
-
-        /*
-         * Keep these two load_memop separate to ensure that the compiler
-         * is able to fold the entire function to a single instruction.
-         * There is a build-time assert inside to remind you of this.  ;-)
-         */
-        if (unlikely(need_swap)) {
-            return load_memop(haddr, op ^ MO_BSWAP);
-        }
-        return load_memop(haddr, op);
-    }
-
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (size > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
-                    >= TARGET_PAGE_SIZE)) {
-        target_ulong addr1, addr2;
-        uint64_t r1, r2;
-        unsigned shift;
-    do_unaligned_access:
-        addr1 = addr & ~((target_ulong)size - 1);
-        addr2 = addr1 + size;
-        r1 = full_load(env, addr1, oi, retaddr);
-        r2 = full_load(env, addr2, oi, retaddr);
-        shift = (addr & (size - 1)) * 8;
-
-        if (memop_big_endian(op)) {
-            /* Big-endian combine.  */
-            res = (r1 << shift) | (r2 >> ((size * 8) - shift));
-        } else {
-            /* Little-endian combine.  */
-            res = (r1 >> shift) | (r2 << ((size * 8) - shift));
-        }
-        return res & MAKE_64BIT_MASK(0, size * 8);
-    }
-
-    haddr = (void *)((uintptr_t)addr + entry->addend);
-    return load_memop(haddr, op);
-}
-
 /*
  * For the benefit of TCG generated code, we want to avoid the
  * complication of ABI-specific return type promotion and always
@@ -2007,90 +2072,250 @@ load_helper(CPUArchState *env, target_ulong addr, MemOpIdx oi,
  * We don't bother with this widened value for SOFTMMU_CODE_ACCESS.
  */
 
-static uint64_t full_ldub_mmu(CPUArchState *env, target_ulong addr,
-                              MemOpIdx oi, uintptr_t retaddr)
+/**
+ * do_ld_mmio_beN:
+ * @env: cpu context
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ * @mmu_idx: virtual address context
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Load @p->size bytes from @p->addr, which is memory-mapped i/o.
+ * The bytes are concatenated with in big-endian order with @ret_be.
+ */
+static uint64_t do_ld_mmio_beN(CPUArchState *env, MMULookupPageData *p,
+                               uint64_t ret_be, int mmu_idx,
+                               MMUAccessType type, uintptr_t ra)
 {
-    validate_memop(oi, MO_UB);
-    return load_helper(env, addr, oi, retaddr, MO_UB, MMU_DATA_LOAD,
-                       full_ldub_mmu);
+    CPUTLBEntryFull *full = p->full;
+    target_ulong addr = p->addr;
+    int i, size = p->size;
+
+    QEMU_IOTHREAD_LOCK_GUARD();
+    for (i = 0; i < size; i++) {
+        uint8_t x = io_readx(env, full, mmu_idx, addr + i, ra, type, MO_UB);
+        ret_be = (ret_be << 8) | x;
+    }
+    return ret_be;
+}
+
+/**
+ * do_ld_bytes_beN
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ *
+ * Load @p->size bytes from @p->haddr, which is RAM.
+ * The bytes to concatenated in big-endian order with @ret_be.
+ */
+static uint64_t do_ld_bytes_beN(MMULookupPageData *p, uint64_t ret_be)
+{
+    uint8_t *haddr = p->haddr;
+    int i, size = p->size;
+
+    for (i = 0; i < size; i++) {
+        ret_be = (ret_be << 8) | haddr[i];
+    }
+    return ret_be;
+}
+
+/*
+ * Wrapper for the above.
+ */
+static uint64_t do_ld_beN(CPUArchState *env, MMULookupPageData *p,
+                          uint64_t ret_be, int mmu_idx,
+                          MMUAccessType type, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return do_ld_mmio_beN(env, p, ret_be, mmu_idx, type, ra);
+    } else {
+        return do_ld_bytes_beN(p, ret_be);
+    }
+}
+
+static uint8_t do_ld_1(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                       MMUAccessType type, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, MO_UB);
+    } else {
+        return *(uint8_t *)p->haddr;
+    }
+}
+
+static uint16_t do_ld_2(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                        MMUAccessType type, MemOp memop, uintptr_t ra)
+{
+    uint64_t ret;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
+    }
+
+    /* Perform the load host endian, then swap if necessary. */
+    ret = load_memop(p->haddr, MO_UW);
+    if (memop & MO_BSWAP) {
+        ret = bswap16(ret);
+    }
+    return ret;
+}
+
+static uint32_t do_ld_4(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                        MMUAccessType type, MemOp memop, uintptr_t ra)
+{
+    uint32_t ret;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
+    }
+
+    /* Perform the load host endian. */
+    ret = load_memop(p->haddr, MO_UL);
+    if (memop & MO_BSWAP) {
+        ret = bswap32(ret);
+    }
+    return ret;
+}
+
+static uint64_t do_ld_8(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
+                        MMUAccessType type, MemOp memop, uintptr_t ra)
+{
+    uint64_t ret;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return io_readx(env, p->full, mmu_idx, p->addr, ra, type, memop);
+    }
+
+    /* Perform the load host endian. */
+    ret = load_memop(p->haddr, MO_UQ);
+    if (memop & MO_BSWAP) {
+        ret = bswap64(ret);
+    }
+    return ret;
+}
+
+static uint8_t do_ld1_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                          uintptr_t ra, MMUAccessType access_type)
+{
+    MMULookupLocals l;
+    bool crosspage;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    tcg_debug_assert(!crosspage);
+
+    return do_ld_1(env, &l.page[0], l.mmu_idx, access_type, ra);
 }
 
 tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
                                      MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_ldub_mmu(env, addr, oi, retaddr);
+    validate_memop(oi, MO_UB);
+    return do_ld1_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
-static uint64_t full_le_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
+static uint16_t do_ld2_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                           uintptr_t ra, MMUAccessType access_type)
 {
-    validate_memop(oi, MO_LEUW);
-    return load_helper(env, addr, oi, retaddr, MO_LEUW, MMU_DATA_LOAD,
-                       full_le_lduw_mmu);
+    MMULookupLocals l;
+    bool crosspage;
+    uint16_t ret;
+    uint8_t a, b;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    if (likely(!crosspage)) {
+        return do_ld_2(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
+    }
+
+    a = do_ld_1(env, &l.page[0], l.mmu_idx, access_type, ra);
+    b = do_ld_1(env, &l.page[1], l.mmu_idx, access_type, ra);
+
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        ret = a | (b << 8);
+    } else {
+        ret = b | (a << 8);
+    }
+    return ret;
 }
 
 tcg_target_ulong helper_le_lduw_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_le_lduw_mmu(env, addr, oi, retaddr);
-}
-
-static uint64_t full_be_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUW);
-    return load_helper(env, addr, oi, retaddr, MO_BEUW, MMU_DATA_LOAD,
-                       full_be_lduw_mmu);
+    validate_memop(oi, MO_LEUW);
+    return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 tcg_target_ulong helper_be_lduw_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_be_lduw_mmu(env, addr, oi, retaddr);
+    validate_memop(oi, MO_BEUW);
+    return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
-static uint64_t full_le_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
+static uint32_t do_ld4_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                           uintptr_t ra, MMUAccessType access_type)
 {
-    validate_memop(oi, MO_LEUL);
-    return load_helper(env, addr, oi, retaddr, MO_LEUL, MMU_DATA_LOAD,
-                       full_le_ldul_mmu);
+    MMULookupLocals l;
+    bool crosspage;
+    uint32_t ret;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    if (likely(!crosspage)) {
+        return do_ld_4(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
+    }
+
+    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
+    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        ret = bswap32(ret);
+    }
+    return ret;
 }
 
 tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_le_ldul_mmu(env, addr, oi, retaddr);
-}
-
-static uint64_t full_be_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                 MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUL);
-    return load_helper(env, addr, oi, retaddr, MO_BEUL, MMU_DATA_LOAD,
-                       full_be_ldul_mmu);
+    validate_memop(oi, MO_LEUL);
+    return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_be_ldul_mmu(env, addr, oi, retaddr);
+    validate_memop(oi, MO_BEUL);
+    return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
+}
+
+static uint64_t do_ld8_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
+                           uintptr_t ra, MMUAccessType access_type)
+{
+    MMULookupLocals l;
+    bool crosspage;
+    uint64_t ret;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, access_type, &l);
+    if (likely(!crosspage)) {
+        return do_ld_8(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
+    }
+
+    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
+    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        ret = bswap64(ret);
+    }
+    return ret;
 }
 
 uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_LEUQ, MMU_DATA_LOAD,
-                       helper_le_ldq_mmu);
+    return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUQ);
-    return load_helper(env, addr, oi, retaddr, MO_BEUQ, MMU_DATA_LOAD,
-                       helper_be_ldq_mmu);
+    return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
 /*
@@ -2133,56 +2358,85 @@ tcg_target_ulong helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr,
  * Load helpers for cpu_ldst.h.
  */
 
-static inline uint64_t cpu_load_helper(CPUArchState *env, abi_ptr addr,
-                                       MemOpIdx oi, uintptr_t retaddr,
-                                       FullLoadHelper *full_load)
+static void plugin_load_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    uint64_t ret;
-
-    ret = full_load(env, addr, oi, retaddr);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
 }
 
 uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_ldub_mmu);
+    uint8_t ret;
+
+    validate_memop(oi, MO_UB);
+    ret = do_ld1_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint16_t cpu_ldw_be_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_be_lduw_mmu);
+    uint16_t ret;
+
+    validate_memop(oi, MO_BEUW);
+    ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_be_ldul_mmu);
+    uint32_t ret;
+
+    validate_memop(oi, MO_BEUL);
+    ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, helper_be_ldq_mmu);
+    uint64_t ret;
+
+    validate_memop(oi, MO_BEUQ);
+    ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint16_t cpu_ldw_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_le_lduw_mmu);
+    uint16_t ret;
+
+    validate_memop(oi, MO_LEUW);
+    ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint32_t cpu_ldl_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, full_le_ldul_mmu);
+    uint32_t ret;
+
+    validate_memop(oi, MO_LEUL);
+    ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    return cpu_load_helper(env, addr, oi, ra, helper_le_ldq_mmu);
+    uint64_t ret;
+
+    validate_memop(oi, MO_LEUQ);
+    ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
@@ -2684,102 +2938,50 @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
 
 /* Code access functions.  */
 
-static uint64_t full_ldub_code(CPUArchState *env, target_ulong addr,
-                               MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_8,
-                       MMU_INST_FETCH, full_ldub_code);
-}
-
 uint32_t cpu_ldub_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_UB, cpu_mmu_index(env, true));
-    return full_ldub_code(env, addr, oi, 0);
-}
-
-static uint64_t full_lduw_code(CPUArchState *env, target_ulong addr,
-                               MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_TEUW,
-                       MMU_INST_FETCH, full_lduw_code);
+    return do_ld1_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint32_t cpu_lduw_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_TEUW, cpu_mmu_index(env, true));
-    return full_lduw_code(env, addr, oi, 0);
-}
-
-static uint64_t full_ldl_code(CPUArchState *env, target_ulong addr,
-                              MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_TEUL,
-                       MMU_INST_FETCH, full_ldl_code);
+    return do_ld2_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint32_t cpu_ldl_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_TEUL, cpu_mmu_index(env, true));
-    return full_ldl_code(env, addr, oi, 0);
-}
-
-static uint64_t full_ldq_code(CPUArchState *env, target_ulong addr,
-                              MemOpIdx oi, uintptr_t retaddr)
-{
-    return load_helper(env, addr, oi, retaddr, MO_TEUQ,
-                       MMU_INST_FETCH, full_ldq_code);
+    return do_ld4_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint64_t cpu_ldq_code(CPUArchState *env, abi_ptr addr)
 {
     MemOpIdx oi = make_memop_idx(MO_TEUQ, cpu_mmu_index(env, true));
-    return full_ldq_code(env, addr, oi, 0);
+    return do_ld8_mmu(env, addr, oi, 0, MMU_INST_FETCH);
 }
 
 uint8_t cpu_ldb_code_mmu(CPUArchState *env, abi_ptr addr,
                          MemOpIdx oi, uintptr_t retaddr)
 {
-    return full_ldub_code(env, addr, oi, retaddr);
+    return do_ld1_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
 
 uint16_t cpu_ldw_code_mmu(CPUArchState *env, abi_ptr addr,
                           MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int idx = get_mmuidx(oi);
-    uint16_t ret;
-
-    ret = full_lduw_code(env, addr, make_memop_idx(MO_TEUW, idx), retaddr);
-    if ((mop & MO_BSWAP) != MO_TE) {
-        ret = bswap16(ret);
-    }
-    return ret;
+    return do_ld2_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
 
 uint32_t cpu_ldl_code_mmu(CPUArchState *env, abi_ptr addr,
                           MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int idx = get_mmuidx(oi);
-    uint32_t ret;
-
-    ret = full_ldl_code(env, addr, make_memop_idx(MO_TEUL, idx), retaddr);
-    if ((mop & MO_BSWAP) != MO_TE) {
-        ret = bswap32(ret);
-    }
-    return ret;
+    return do_ld4_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
 
 uint64_t cpu_ldq_code_mmu(CPUArchState *env, abi_ptr addr,
                           MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int idx = get_mmuidx(oi);
-    uint64_t ret;
-
-    ret = full_ldq_code(env, addr, make_memop_idx(MO_TEUQ, idx), retaddr);
-    if ((mop & MO_BSWAP) != MO_TE) {
-        ret = bswap64(ret);
-    }
-    return ret;
+    return do_ld8_mmu(env, addr, oi, retaddr, MMU_INST_FETCH);
 }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 05/57] accel/tcg: Reorg system mode store helpers
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (3 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 04/57] accel/tcg: Reorg system mode load helpers Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-04 15:44   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 06/57] accel/tcg: Honor atomicity of loads Richard Henderson
                   ` (52 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of trying to unify all operations on uint64_t, use
mmu_lookup() to perform the basic tlb hit and resolution.
Create individual functions to handle access by size.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/cputlb.c | 408 +++++++++++++++++++++------------------------
 1 file changed, 193 insertions(+), 215 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index dd68514260..f52c7e6da0 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -2531,322 +2531,300 @@ store_memop(void *haddr, uint64_t val, MemOp op)
     }
 }
 
-static void full_stb_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                         MemOpIdx oi, uintptr_t retaddr);
-
-static void __attribute__((noinline))
-store_helper_unaligned(CPUArchState *env, target_ulong addr, uint64_t val,
-                       uintptr_t retaddr, size_t size, uintptr_t mmu_idx,
-                       bool big_endian)
+/**
+ * do_st_mmio_leN:
+ * @env: cpu context
+ * @p: translation parameters
+ * @val_le: data to store
+ * @mmu_idx: virtual address context
+ * @ra: return address into tcg generated code, or 0
+ *
+ * Store @p->size bytes at @p->addr, which is memory-mapped i/o.
+ * The bytes to store are extracted in little-endian order from @val_le;
+ * return the bytes of @val_le beyond @p->size that have not been stored.
+ */
+static uint64_t do_st_mmio_leN(CPUArchState *env, MMULookupPageData *p,
+                               uint64_t val_le, int mmu_idx, uintptr_t ra)
 {
-    uintptr_t index, index2;
-    CPUTLBEntry *entry, *entry2;
-    target_ulong page1, page2, tlb_addr, tlb_addr2;
-    MemOpIdx oi;
-    size_t size2;
-    int i;
+    CPUTLBEntryFull *full = p->full;
+    target_ulong addr = p->addr;
+    int i, size = p->size;
 
-    /*
-     * Ensure the second page is in the TLB.  Note that the first page
-     * is already guaranteed to be filled, and that the second page
-     * cannot evict the first.  An exception to this rule is PAGE_WRITE_INV
-     * handling: the first page could have evicted itself.
-     */
-    page1 = addr & TARGET_PAGE_MASK;
-    page2 = (addr + size) & TARGET_PAGE_MASK;
-    size2 = (addr + size) & ~TARGET_PAGE_MASK;
-    index2 = tlb_index(env, mmu_idx, page2);
-    entry2 = tlb_entry(env, mmu_idx, page2);
-
-    tlb_addr2 = tlb_addr_write(entry2);
-    if (page1 != page2 && !tlb_hit_page(tlb_addr2, page2)) {
-        if (!victim_tlb_hit(env, mmu_idx, index2, MMU_DATA_STORE, page2)) {
-            tlb_fill(env_cpu(env), page2, size2, MMU_DATA_STORE,
-                     mmu_idx, retaddr);
-            index2 = tlb_index(env, mmu_idx, page2);
-            entry2 = tlb_entry(env, mmu_idx, page2);
-        }
-        tlb_addr2 = tlb_addr_write(entry2);
+    QEMU_IOTHREAD_LOCK_GUARD();
+    for (i = 0; i < size; i++, val_le >>= 8) {
+        io_writex(env, full, mmu_idx, val_le, addr + i, ra, MO_UB);
     }
+    return val_le;
+}
 
-    index = tlb_index(env, mmu_idx, addr);
-    entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = tlb_addr_write(entry);
+/**
+ * do_st_bytes_leN:
+ * @p: translation parameters
+ * @val_le: data to store
+ *
+ * Store @p->size bytes at @p->haddr, which is RAM.
+ * The bytes to store are extracted in little-endian order from @val_le;
+ * return the bytes of @val_le beyond @p->size that have not been stored.
+ */
+static uint64_t do_st_bytes_leN(MMULookupPageData *p, uint64_t val_le)
+{
+    uint8_t *haddr = p->haddr;
+    int i, size = p->size;
 
-    /*
-     * Handle watchpoints.  Since this may trap, all checks
-     * must happen before any store.
-     */
-    if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
-        cpu_check_watchpoint(env_cpu(env), addr, size - size2,
-                             env_tlb(env)->d[mmu_idx].fulltlb[index].attrs,
-                             BP_MEM_WRITE, retaddr);
-    }
-    if (unlikely(tlb_addr2 & TLB_WATCHPOINT)) {
-        cpu_check_watchpoint(env_cpu(env), page2, size2,
-                             env_tlb(env)->d[mmu_idx].fulltlb[index2].attrs,
-                             BP_MEM_WRITE, retaddr);
+    for (i = 0; i < size; i++, val_le >>= 8) {
+        haddr[i] = val_le;
     }
+    return val_le;
+}
 
-    /*
-     * XXX: not efficient, but simple.
-     * This loop must go in the forward direction to avoid issues
-     * with self-modifying code in Windows 64-bit.
-     */
-    oi = make_memop_idx(MO_UB, mmu_idx);
-    if (big_endian) {
-        for (i = 0; i < size; ++i) {
-            /* Big-endian extract.  */
-            uint8_t val8 = val >> (((size - 1) * 8) - (i * 8));
-            full_stb_mmu(env, addr + i, val8, oi, retaddr);
-        }
+/*
+ * Wrapper for the above.
+ */
+static uint64_t do_st_leN(CPUArchState *env, MMULookupPageData *p,
+                          uint64_t val_le, int mmu_idx, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        return do_st_mmio_leN(env, p, val_le, mmu_idx, ra);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        return val_le >> (p->size * 8);
     } else {
-        for (i = 0; i < size; ++i) {
-            /* Little-endian extract.  */
-            uint8_t val8 = val >> (i * 8);
-            full_stb_mmu(env, addr + i, val8, oi, retaddr);
-        }
+        return do_st_bytes_leN(p, val_le);
     }
 }
 
-static inline void QEMU_ALWAYS_INLINE
-store_helper(CPUArchState *env, target_ulong addr, uint64_t val,
-             MemOpIdx oi, uintptr_t retaddr, MemOp op)
+static void do_st_1(CPUArchState *env, MMULookupPageData *p, uint8_t val,
+                    int mmu_idx, uintptr_t ra)
 {
-    const unsigned a_bits = get_alignment_bits(get_memop(oi));
-    const size_t size = memop_size(op);
-    uintptr_t mmu_idx = get_mmuidx(oi);
-    uintptr_t index;
-    CPUTLBEntry *entry;
-    target_ulong tlb_addr;
-    void *haddr;
-
-    tcg_debug_assert(mmu_idx < NB_MMU_MODES);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_STORE,
-                             mmu_idx, retaddr);
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, MO_UB);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        *(uint8_t *)p->haddr = val;
     }
-
-    index = tlb_index(env, mmu_idx, addr);
-    entry = tlb_entry(env, mmu_idx, addr);
-    tlb_addr = tlb_addr_write(entry);
-
-    /* If the TLB entry is for a different page, reload and try again.  */
-    if (!tlb_hit(tlb_addr, addr)) {
-        if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
-            addr & TARGET_PAGE_MASK)) {
-            tlb_fill(env_cpu(env), addr, size, MMU_DATA_STORE,
-                     mmu_idx, retaddr);
-            index = tlb_index(env, mmu_idx, addr);
-            entry = tlb_entry(env, mmu_idx, addr);
-        }
-        tlb_addr = tlb_addr_write(entry) & ~TLB_INVALID_MASK;
-    }
-
-    /* Handle anything that isn't just a straight memory access.  */
-    if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUTLBEntryFull *full;
-        bool need_swap;
-
-        /* For anything that is unaligned, recurse through byte stores.  */
-        if ((addr & (size - 1)) != 0) {
-            goto do_unaligned_access;
-        }
-
-        full = &env_tlb(env)->d[mmu_idx].fulltlb[index];
-
-        /* Handle watchpoints.  */
-        if (unlikely(tlb_addr & TLB_WATCHPOINT)) {
-            /* On watchpoint hit, this will longjmp out.  */
-            cpu_check_watchpoint(env_cpu(env), addr, size,
-                                 full->attrs, BP_MEM_WRITE, retaddr);
-        }
-
-        need_swap = size > 1 && (tlb_addr & TLB_BSWAP);
-
-        /* Handle I/O access.  */
-        if (tlb_addr & TLB_MMIO) {
-            io_writex(env, full, mmu_idx, val, addr, retaddr,
-                      op ^ (need_swap * MO_BSWAP));
-            return;
-        }
-
-        /* Ignore writes to ROM.  */
-        if (unlikely(tlb_addr & TLB_DISCARD_WRITE)) {
-            return;
-        }
-
-        /* Handle clean RAM pages.  */
-        if (tlb_addr & TLB_NOTDIRTY) {
-            notdirty_write(env_cpu(env), addr, size, full, retaddr);
-        }
-
-        haddr = (void *)((uintptr_t)addr + entry->addend);
-
-        /*
-         * Keep these two store_memop separate to ensure that the compiler
-         * is able to fold the entire function to a single instruction.
-         * There is a build-time assert inside to remind you of this.  ;-)
-         */
-        if (unlikely(need_swap)) {
-            store_memop(haddr, val, op ^ MO_BSWAP);
-        } else {
-            store_memop(haddr, val, op);
-        }
-        return;
-    }
-
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (size > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1
-                     >= TARGET_PAGE_SIZE)) {
-    do_unaligned_access:
-        store_helper_unaligned(env, addr, val, retaddr, size,
-                               mmu_idx, memop_big_endian(op));
-        return;
-    }
-
-    haddr = (void *)((uintptr_t)addr + entry->addend);
-    store_memop(haddr, val, op);
 }
 
-static void __attribute__((noinline))
-full_stb_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-             MemOpIdx oi, uintptr_t retaddr)
+static void do_st_2(CPUArchState *env, MMULookupPageData *p, uint16_t val,
+                    int mmu_idx, MemOp memop, uintptr_t ra)
 {
-    validate_memop(oi, MO_UB);
-    store_helper(env, addr, val, oi, retaddr, MO_UB);
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        /* Swap to host endian if necessary, then store. */
+        if (memop & MO_BSWAP) {
+            val = bswap16(val);
+        }
+        store_memop(p->haddr, val, MO_UW);
+    }
+}
+
+static void do_st_4(CPUArchState *env, MMULookupPageData *p, uint32_t val,
+                    int mmu_idx, MemOp memop, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        /* Swap to host endian if necessary, then store. */
+        if (memop & MO_BSWAP) {
+            val = bswap32(val);
+        }
+        store_memop(p->haddr, val, MO_UL);
+    }
+}
+
+static void do_st_8(CPUArchState *env, MMULookupPageData *p, uint64_t val,
+                    int mmu_idx, MemOp memop, uintptr_t ra)
+{
+    if (unlikely(p->flags & TLB_MMIO)) {
+        io_writex(env, p->full, mmu_idx, val, p->addr, ra, memop);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        /* nothing */
+    } else {
+        /* Swap to host endian if necessary, then store. */
+        if (memop & MO_BSWAP) {
+            val = bswap64(val);
+        }
+        store_memop(p->haddr, val, MO_UQ);
+    }
 }
 
 void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                        MemOpIdx oi, uintptr_t retaddr)
+                        MemOpIdx oi, uintptr_t ra)
 {
-    full_stb_mmu(env, addr, val, oi, retaddr);
+    MMULookupLocals l;
+    bool crosspage;
+
+    validate_memop(oi, MO_UB);
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    tcg_debug_assert(!crosspage);
+
+    do_st_1(env, &l.page[0], val, l.mmu_idx, ra);
 }
 
-static void full_le_stw_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
+static void do_st2_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
+                       MemOpIdx oi, uintptr_t ra)
 {
-    validate_memop(oi, MO_LEUW);
-    store_helper(env, addr, val, oi, retaddr, MO_LEUW);
+    MMULookupLocals l;
+    bool crosspage;
+    uint8_t a, b;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    if (likely(!crosspage)) {
+        do_st_2(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+        return;
+    }
+
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        a = val, b = val >> 8;
+    } else {
+        b = val, a = val >> 8;
+    }
+    do_st_1(env, &l.page[0], a, l.mmu_idx, ra);
+    do_st_1(env, &l.page[1], b, l.mmu_idx, ra);
 }
 
 void helper_le_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_le_stw_mmu(env, addr, val, oi, retaddr);
-}
-
-static void full_be_stw_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUW);
-    store_helper(env, addr, val, oi, retaddr, MO_BEUW);
+    validate_memop(oi, MO_LEUW);
+    do_st2_mmu(env, addr, val, oi, retaddr);
 }
 
 void helper_be_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_be_stw_mmu(env, addr, val, oi, retaddr);
+    validate_memop(oi, MO_BEUW);
+    do_st2_mmu(env, addr, val, oi, retaddr);
 }
 
-static void full_le_stl_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
+static void do_st4_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                       MemOpIdx oi, uintptr_t ra)
 {
-    validate_memop(oi, MO_LEUL);
-    store_helper(env, addr, val, oi, retaddr, MO_LEUL);
+    MMULookupLocals l;
+    bool crosspage;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    if (likely(!crosspage)) {
+        do_st_4(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+        return;
+    }
+
+    /* Swap to little endian for simplicity, then store by bytes. */
+    if ((l.memop & MO_BSWAP) != MO_LE) {
+        val = bswap32(val);
+    }
+    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
+    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
 }
 
 void helper_le_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_le_stl_mmu(env, addr, val, oi, retaddr);
-}
-
-static void full_be_stl_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                            MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUL);
-    store_helper(env, addr, val, oi, retaddr, MO_BEUL);
+    validate_memop(oi, MO_LEUL);
+    do_st4_mmu(env, addr, val, oi, retaddr);
 }
 
 void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
-    full_be_stl_mmu(env, addr, val, oi, retaddr);
+    validate_memop(oi, MO_BEUL);
+    do_st4_mmu(env, addr, val, oi, retaddr);
+}
+
+static void do_st8_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                       MemOpIdx oi, uintptr_t ra)
+{
+    MMULookupLocals l;
+    bool crosspage;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    if (likely(!crosspage)) {
+        do_st_8(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+        return;
+    }
+
+    /* Swap to little endian for simplicity, then store by bytes. */
+    if ((l.memop & MO_BSWAP) != MO_LE) {
+        val = bswap64(val);
+    }
+    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
+    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
 }
 
 void helper_le_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_LEUQ);
-    store_helper(env, addr, val, oi, retaddr, MO_LEUQ);
+    do_st8_mmu(env, addr, val, oi, retaddr);
 }
 
 void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                        MemOpIdx oi, uintptr_t retaddr)
 {
     validate_memop(oi, MO_BEUQ);
-    store_helper(env, addr, val, oi, retaddr, MO_BEUQ);
+    do_st8_mmu(env, addr, val, oi, retaddr);
 }
 
 /*
  * Store Helpers for cpu_ldst.h
  */
 
-typedef void FullStoreHelper(CPUArchState *env, target_ulong addr,
-                             uint64_t val, MemOpIdx oi, uintptr_t retaddr);
-
-static inline void cpu_store_helper(CPUArchState *env, target_ulong addr,
-                                    uint64_t val, MemOpIdx oi, uintptr_t ra,
-                                    FullStoreHelper *full_store)
+static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 {
-    full_store(env, addr, val, oi, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
 void cpu_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
                  MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_stb_mmu);
+    helper_ret_stb_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stw_be_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_be_stw_mmu);
+    helper_be_stw_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stl_be_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_be_stl_mmu);
+    helper_be_stl_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stq_be_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, helper_be_stq_mmu);
+    helper_be_stq_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stw_le_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_le_stw_mmu);
+    helper_le_stw_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stl_le_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, full_le_stl_mmu);
+    helper_le_stl_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stq_le_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    cpu_store_helper(env, addr, val, oi, retaddr, helper_le_stq_mmu);
+    helper_le_stq_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 06/57] accel/tcg: Honor atomicity of loads
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (4 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 05/57] accel/tcg: Reorg system mode store helpers Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-04 17:17   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 07/57] accel/tcg: Honor atomicity of stores Richard Henderson
                   ` (51 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x, Alex Bennée

Create ldst_atomicity.c.inc.

Not required for user-only code loads, because we've ensured that
the page is read-only before beginning to translate code.

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/cputlb.c             | 170 +++++++---
 accel/tcg/user-exec.c          |  26 +-
 accel/tcg/ldst_atomicity.c.inc | 550 +++++++++++++++++++++++++++++++++
 3 files changed, 695 insertions(+), 51 deletions(-)
 create mode 100644 accel/tcg/ldst_atomicity.c.inc

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index f52c7e6da0..6f3a419fe8 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -1668,6 +1668,9 @@ tb_page_addr_t get_page_addr_code_hostp(CPUArchState *env, target_ulong addr,
     return qemu_ram_addr_from_host_nofail(p);
 }
 
+/* Load/store with atomicity primitives. */
+#include "ldst_atomicity.c.inc"
+
 #ifdef CONFIG_PLUGIN
 /*
  * Perform a TLB lookup and populate the qemu_plugin_hwaddr structure.
@@ -2034,35 +2037,7 @@ static void validate_memop(MemOpIdx oi, MemOp expected)
  * specifically for reading instructions from system memory. It is
  * called by the translation loop and in some helpers where the code
  * is disassembled. It shouldn't be called directly by guest code.
- */
-
-typedef uint64_t FullLoadHelper(CPUArchState *env, target_ulong addr,
-                                MemOpIdx oi, uintptr_t retaddr);
-
-static inline uint64_t QEMU_ALWAYS_INLINE
-load_memop(const void *haddr, MemOp op)
-{
-    switch (op) {
-    case MO_UB:
-        return ldub_p(haddr);
-    case MO_BEUW:
-        return lduw_be_p(haddr);
-    case MO_LEUW:
-        return lduw_le_p(haddr);
-    case MO_BEUL:
-        return (uint32_t)ldl_be_p(haddr);
-    case MO_LEUL:
-        return (uint32_t)ldl_le_p(haddr);
-    case MO_BEUQ:
-        return ldq_be_p(haddr);
-    case MO_LEUQ:
-        return ldq_le_p(haddr);
-    default:
-        qemu_build_not_reached();
-    }
-}
-
-/*
+ *
  * For the benefit of TCG generated code, we want to avoid the
  * complication of ABI-specific return type promotion and always
  * return a value extended to the register size of the host. This is
@@ -2118,17 +2093,134 @@ static uint64_t do_ld_bytes_beN(MMULookupPageData *p, uint64_t ret_be)
     return ret_be;
 }
 
+/**
+ * do_ld_parts_beN
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ *
+ * As do_ld_bytes_beN, but atomically on each aligned part.
+ */
+static uint64_t do_ld_parts_beN(MMULookupPageData *p, uint64_t ret_be)
+{
+    void *haddr = p->haddr;
+    int size = p->size;
+
+    do {
+        uint64_t x;
+        int n;
+
+        /*
+         * Find minimum of alignment and size.
+         * This is slightly stronger than required by MO_ATOM_SUBALIGN, which
+         * would have only checked the low bits of addr|size once at the start,
+         * but is just as easy.
+         */
+        switch (((uintptr_t)haddr | size) & 7) {
+        case 4:
+            x = cpu_to_be32(load_atomic4(haddr));
+            ret_be = (ret_be << 32) | x;
+            n = 4;
+            break;
+        case 2:
+        case 6:
+            x = cpu_to_be16(load_atomic2(haddr));
+            ret_be = (ret_be << 16) | x;
+            n = 2;
+            break;
+        default:
+            x = *(uint8_t *)haddr;
+            ret_be = (ret_be << 8) | x;
+            n = 1;
+            break;
+        case 0:
+            g_assert_not_reached();
+        }
+        haddr += n;
+        size -= n;
+    } while (size != 0);
+    return ret_be;
+}
+
+/**
+ * do_ld_parts_be4
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ *
+ * As do_ld_bytes_beN, but with one atomic load.
+ * Four aligned bytes are guaranteed to cover the load.
+ */
+static uint64_t do_ld_whole_be4(MMULookupPageData *p, uint64_t ret_be)
+{
+    int o = p->addr & 3;
+    uint32_t x = load_atomic4(p->haddr - o);
+
+    x = cpu_to_be32(x);
+    x <<= o * 8;
+    x >>= (4 - p->size) * 8;
+    return (ret_be << (p->size * 8)) | x;
+}
+
+/**
+ * do_ld_parts_be8
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ *
+ * As do_ld_bytes_beN, but with one atomic load.
+ * Eight aligned bytes are guaranteed to cover the load.
+ */
+static uint64_t do_ld_whole_be8(CPUArchState *env, uintptr_t ra,
+                                MMULookupPageData *p, uint64_t ret_be)
+{
+    int o = p->addr & 7;
+    uint64_t x = load_atomic8_or_exit(env, ra, p->haddr - o);
+
+    x = cpu_to_be64(x);
+    x <<= o * 8;
+    x >>= (8 - p->size) * 8;
+    return (ret_be << (p->size * 8)) | x;
+}
+
 /*
  * Wrapper for the above.
  */
 static uint64_t do_ld_beN(CPUArchState *env, MMULookupPageData *p,
-                          uint64_t ret_be, int mmu_idx,
-                          MMUAccessType type, uintptr_t ra)
+                          uint64_t ret_be, int mmu_idx, MMUAccessType type,
+                          MemOp mop, uintptr_t ra)
 {
+    MemOp atmax;
+
     if (unlikely(p->flags & TLB_MMIO)) {
         return do_ld_mmio_beN(env, p, ret_be, mmu_idx, type, ra);
-    } else {
+    }
+
+    switch (mop & MO_ATOM_MASK) {
+    case MO_ATOM_WITHIN16:
+        /*
+         * It is a given that we cross a page and therefore there is no
+         * atomicity for the load as a whole, but there may be a subobject
+         * as defined by ATMAX which does not cross a 16-byte boundary.
+         */
+        atmax = mop & MO_ATMAX_MASK;
+        if (atmax == MO_ATMAX_SIZE) {
+            atmax = mop & MO_SIZE;
+        } else {
+            atmax >>= MO_ATMAX_SHIFT;
+        }
+        if (unlikely(p->size >= (1 << atmax))) {
+            if (!HAVE_al8_fast && p->size < 4) {
+                return do_ld_whole_be4(p, ret_be);
+            } else {
+                return do_ld_whole_be8(env, ra, p, ret_be);
+            }
+        }
+        /* fall through */
+    case MO_ATOM_IFALIGN:
+    case MO_ATOM_NONE:
         return do_ld_bytes_beN(p, ret_be);
+    case MO_ATOM_SUBALIGN:
+        return do_ld_parts_beN(p, ret_be);
+    default:
+        g_assert_not_reached();
     }
 }
 
@@ -2152,7 +2244,7 @@ static uint16_t do_ld_2(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
     }
 
     /* Perform the load host endian, then swap if necessary. */
-    ret = load_memop(p->haddr, MO_UW);
+    ret = load_atom_2(env, ra, p->haddr, memop);
     if (memop & MO_BSWAP) {
         ret = bswap16(ret);
     }
@@ -2169,7 +2261,7 @@ static uint32_t do_ld_4(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
     }
 
     /* Perform the load host endian. */
-    ret = load_memop(p->haddr, MO_UL);
+    ret = load_atom_4(env, ra, p->haddr, memop);
     if (memop & MO_BSWAP) {
         ret = bswap32(ret);
     }
@@ -2186,7 +2278,7 @@ static uint64_t do_ld_8(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
     }
 
     /* Perform the load host endian. */
-    ret = load_memop(p->haddr, MO_UQ);
+    ret = load_atom_8(env, ra, p->haddr, memop);
     if (memop & MO_BSWAP) {
         ret = bswap64(ret);
     }
@@ -2262,8 +2354,8 @@ static uint32_t do_ld4_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
         return do_ld_4(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
     }
 
-    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
-    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
+    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, l.memop, ra);
+    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, l.memop, ra);
     if ((l.memop & MO_BSWAP) == MO_LE) {
         ret = bswap32(ret);
     }
@@ -2296,8 +2388,8 @@ static uint64_t do_ld8_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
         return do_ld_8(env, &l.page[0], l.mmu_idx, access_type, l.memop, ra);
     }
 
-    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, ra);
-    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, ra);
+    ret = do_ld_beN(env, &l.page[0], 0, l.mmu_idx, access_type, l.memop, ra);
+    ret = do_ld_beN(env, &l.page[1], ret, l.mmu_idx, access_type, l.memop, ra);
     if ((l.memop & MO_BSWAP) == MO_LE) {
         ret = bswap64(ret);
     }
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index fc597a010d..fefc83cc8c 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -931,6 +931,8 @@ static void *cpu_mmu_lookup(CPUArchState *env, target_ulong addr,
     return ret;
 }
 
+#include "ldst_atomicity.c.inc"
+
 uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr,
                     MemOpIdx oi, uintptr_t ra)
 {
@@ -953,10 +955,10 @@ uint16_t cpu_ldw_be_mmu(CPUArchState *env, abi_ptr addr,
 
     validate_memop(oi, MO_BEUW);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = lduw_be_p(haddr);
+    ret = load_atom_2(env, ra, haddr, get_memop(oi));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
+    return cpu_to_be16(ret);
 }
 
 uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
@@ -967,10 +969,10 @@ uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
 
     validate_memop(oi, MO_BEUL);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = ldl_be_p(haddr);
+    ret = load_atom_4(env, ra, haddr, get_memop(oi));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
+    return cpu_to_be32(ret);
 }
 
 uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
@@ -981,10 +983,10 @@ uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
 
     validate_memop(oi, MO_BEUQ);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = ldq_be_p(haddr);
+    ret = load_atom_8(env, ra, haddr, get_memop(oi));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
+    return cpu_to_be64(ret);
 }
 
 uint16_t cpu_ldw_le_mmu(CPUArchState *env, abi_ptr addr,
@@ -995,10 +997,10 @@ uint16_t cpu_ldw_le_mmu(CPUArchState *env, abi_ptr addr,
 
     validate_memop(oi, MO_LEUW);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = lduw_le_p(haddr);
+    ret = load_atom_2(env, ra, haddr, get_memop(oi));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
+    return cpu_to_le16(ret);
 }
 
 uint32_t cpu_ldl_le_mmu(CPUArchState *env, abi_ptr addr,
@@ -1009,10 +1011,10 @@ uint32_t cpu_ldl_le_mmu(CPUArchState *env, abi_ptr addr,
 
     validate_memop(oi, MO_LEUL);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = ldl_le_p(haddr);
+    ret = load_atom_4(env, ra, haddr, get_memop(oi));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
+    return cpu_to_le32(ret);
 }
 
 uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
@@ -1023,10 +1025,10 @@ uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
 
     validate_memop(oi, MO_LEUQ);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = ldq_le_p(haddr);
+    ret = load_atom_8(env, ra, haddr, get_memop(oi));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return ret;
+    return cpu_to_le64(ret);
 }
 
 Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
new file mode 100644
index 0000000000..5169073431
--- /dev/null
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -0,0 +1,550 @@
+/*
+ * Routines common to user and system emulation of load/store.
+ *
+ *  Copyright (c) 2022 Linaro, Ltd.
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ */
+
+#ifdef CONFIG_ATOMIC64
+# define HAVE_al8          true
+#else
+# define HAVE_al8          false
+#endif
+#define HAVE_al8_fast      (ATOMIC_REG_SIZE >= 8)
+
+#if defined(CONFIG_ATOMIC128)
+# define HAVE_al16_fast    true
+#else
+# define HAVE_al16_fast    false
+#endif
+
+/**
+ * required_atomicity:
+ *
+ * Return the lg2 bytes of atomicity required by @memop for @p.
+ * If the operation must be split into two operations to be
+ * examined separately for atomicity, return -lg2.
+ */
+static int required_atomicity(CPUArchState *env, uintptr_t p, MemOp memop)
+{
+    int atmax = memop & MO_ATMAX_MASK;
+    int size = memop & MO_SIZE;
+    unsigned tmp;
+
+    if (atmax == MO_ATMAX_SIZE) {
+        atmax = size;
+    } else {
+        atmax >>= MO_ATMAX_SHIFT;
+    }
+
+    switch (memop & MO_ATOM_MASK) {
+    case MO_ATOM_IFALIGN:
+        tmp = (1 << atmax) - 1;
+        if (p & tmp) {
+            return MO_8;
+        }
+        break;
+    case MO_ATOM_NONE:
+        return MO_8;
+    case MO_ATOM_SUBALIGN:
+        tmp = p & -p;
+        if (tmp != 0 && tmp < atmax) {
+            atmax = tmp;
+        }
+        break;
+    case MO_ATOM_WITHIN16:
+        tmp = p & 15;
+        if (tmp + (1 << size) <= 16) {
+            atmax = size;
+        } else if (atmax == size) {
+            return MO_8;
+        } else if (tmp + (1 << atmax) != 16) {
+            /*
+             * Paired load/store, where the pairs aren't aligned.
+             * One of the two must still be handled atomically.
+             */
+            atmax = -atmax;
+        }
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    /*
+     * Here we have the architectural atomicity of the operation.
+     * However, when executing in a serial context, we need no extra
+     * host atomicity in order to avoid racing.  This reduction
+     * avoids looping with cpu_loop_exit_atomic.
+     */
+    if (cpu_in_serial_context(env_cpu(env))) {
+        return MO_8;
+    }
+    return atmax;
+}
+
+/**
+ * load_atomic2:
+ * @pv: host address
+ *
+ * Atomically load 2 aligned bytes from @pv.
+ */
+static inline uint16_t load_atomic2(void *pv)
+{
+    uint16_t *p = __builtin_assume_aligned(pv, 2);
+    return qatomic_read(p);
+}
+
+/**
+ * load_atomic4:
+ * @pv: host address
+ *
+ * Atomically load 4 aligned bytes from @pv.
+ */
+static inline uint32_t load_atomic4(void *pv)
+{
+    uint32_t *p = __builtin_assume_aligned(pv, 4);
+    return qatomic_read(p);
+}
+
+/**
+ * load_atomic8:
+ * @pv: host address
+ *
+ * Atomically load 8 aligned bytes from @pv.
+ */
+static inline uint64_t load_atomic8(void *pv)
+{
+    uint64_t *p = __builtin_assume_aligned(pv, 8);
+
+    qemu_build_assert(HAVE_al8);
+    return qatomic_read__nocheck(p);
+}
+
+/**
+ * load_atomic16:
+ * @pv: host address
+ *
+ * Atomically load 16 aligned bytes from @pv.
+ */
+static inline Int128 load_atomic16(void *pv)
+{
+#ifdef CONFIG_ATOMIC128
+    __uint128_t *p = __builtin_assume_aligned(pv, 16);
+    Int128Alias r;
+
+    r.u = qatomic_read__nocheck(p);
+    return r.s;
+#else
+    qemu_build_not_reached();
+#endif
+}
+
+/**
+ * load_atomic8_or_exit:
+ * @env: cpu context
+ * @ra: host unwind address
+ * @pv: host address
+ *
+ * Atomically load 8 aligned bytes from @pv.
+ * If this is not possible, longjmp out to restart serially.
+ */
+static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
+{
+    if (HAVE_al8) {
+        return load_atomic8(pv);
+    }
+
+#ifdef CONFIG_USER_ONLY
+    /*
+     * If the page is not writable, then assume the value is immutable
+     * and requires no locking.  This ignores the case of MAP_SHARED with
+     * another process, because the fallback start_exclusive solution
+     * provides no protection across processes.
+     */
+    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
+        uint64_t *p = __builtin_assume_aligned(pv, 8);
+        return *p;
+    }
+#endif
+
+    /* Ultimate fallback: re-execute in serial context. */
+    cpu_loop_exit_atomic(env_cpu(env), ra);
+}
+
+/**
+ * load_atomic16_or_exit:
+ * @env: cpu context
+ * @ra: host unwind address
+ * @pv: host address
+ *
+ * Atomically load 16 aligned bytes from @pv.
+ * If this is not possible, longjmp out to restart serially.
+ */
+static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
+{
+    Int128 *p = __builtin_assume_aligned(pv, 16);
+
+    if (HAVE_al16_fast) {
+        return load_atomic16(p);
+    }
+
+#ifdef CONFIG_USER_ONLY
+    /*
+     * We can only use cmpxchg to emulate a load if the page is writable.
+     * If the page is not writable, then assume the value is immutable
+     * and requires no locking.  This ignores the case of MAP_SHARED with
+     * another process, because the fallback start_exclusive solution
+     * provides no protection across processes.
+     */
+    if (!page_check_range(h2g(p), 16, PAGE_WRITE)) {
+        return *p;
+    }
+#endif
+
+    /*
+     * In system mode all guest pages are writable, and for user-only
+     * we have just checked writability.  Try cmpxchg.
+     */
+#if defined(CONFIG_CMPXCHG128)
+    /* Swap 0 with 0, with the side-effect of returning the old value. */
+    {
+        Int128Alias r;
+        r.u = __sync_val_compare_and_swap_16((__uint128_t *)p, 0, 0);
+        return r.s;
+    }
+#endif
+
+    /* Ultimate fallback: re-execute in serial context. */
+    cpu_loop_exit_atomic(env_cpu(env), ra);
+}
+
+/**
+ * load_atom_extract_al4x2:
+ * @pv: host address
+ *
+ * Load 4 bytes from @p, from two sequential atomic 4-byte loads.
+ */
+static uint32_t load_atom_extract_al4x2(void *pv)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int sh = (pi & 3) * 8;
+    uint32_t a, b;
+
+    pv = (void *)(pi & ~3);
+    a = load_atomic4(pv);
+    b = load_atomic4(pv + 4);
+
+    if (HOST_BIG_ENDIAN) {
+        return (a << sh) | (b >> (-sh & 31));
+    } else {
+        return (a >> sh) | (b << (-sh & 31));
+    }
+}
+
+/**
+ * load_atom_extract_al8x2:
+ * @pv: host address
+ *
+ * Load 8 bytes from @p, from two sequential atomic 8-byte loads.
+ */
+static uint64_t load_atom_extract_al8x2(void *pv)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int sh = (pi & 7) * 8;
+    uint64_t a, b;
+
+    pv = (void *)(pi & ~7);
+    a = load_atomic8(pv);
+    b = load_atomic8(pv + 8);
+
+    if (HOST_BIG_ENDIAN) {
+        return (a << sh) | (b >> (-sh & 63));
+    } else {
+        return (a >> sh) | (b << (-sh & 63));
+    }
+}
+
+/**
+ * load_atom_extract_al8_or_exit:
+ * @env: cpu context
+ * @ra: host unwind address
+ * @pv: host address
+ * @s: object size in bytes, @s <= 4.
+ *
+ * Atomically load @s bytes from @p, when p % s != 0, and [p, p+s-1] does
+ * not cross an 8-byte boundary.  This means that we can perform an atomic
+ * 8-byte load and extract.
+ * The value is returned in the low bits of a uint32_t.
+ */
+static uint32_t load_atom_extract_al8_or_exit(CPUArchState *env, uintptr_t ra,
+                                              void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int o = pi & 7;
+    int shr = (HOST_BIG_ENDIAN ? 8 - s - o : o) * 8;
+
+    pv = (void *)(pi & ~7);
+    return load_atomic8_or_exit(env, ra, pv) >> shr;
+}
+
+/**
+ * load_atom_extract_al16_or_exit:
+ * @env: cpu context
+ * @ra: host unwind address
+ * @p: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Atomically load @s bytes from @p, when p % 16 < 8
+ * and p % 16 + s > 8.  I.e. does not cross a 16-byte
+ * boundary, but *does* cross an 8-byte boundary.
+ * This is the slow version, so we must have eliminated
+ * any faster load_atom_extract_al8_or_exit case.
+ *
+ * If this is not possible, longjmp out to restart serially.
+ */
+static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
+                                               void *pv, int s)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int o = pi & 7;
+    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
+    Int128 r;
+
+    /*
+     * Note constraints above: p & 8 must be clear.
+     * Provoke SIGBUS if possible otherwise.
+     */
+    pv = (void *)(pi & ~7);
+    r = load_atomic16_or_exit(env, ra, pv);
+
+    r = int128_urshift(r, shr);
+    return int128_getlo(r);
+}
+
+/**
+ * load_atom_extract_al16_or_al8:
+ * @p: host address
+ * @s: object size in bytes, @s <= 8.
+ *
+ * Load @s bytes from @p, when p % s != 0.  If [p, p+s-1] does not
+ * cross an 16-byte boundary then the access must be 16-byte atomic,
+ * otherwise the access must be 8-byte atomic.
+ */
+static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
+{
+#if defined(CONFIG_ATOMIC128)
+    uintptr_t pi = (uintptr_t)pv;
+    int o = pi & 7;
+    int shr = (HOST_BIG_ENDIAN ? 16 - s - o : o) * 8;
+    __uint128_t r;
+
+    pv = (void *)(pi & ~7);
+    if (pi & 8) {
+        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
+        uint64_t a = qatomic_read__nocheck(p8);
+        uint64_t b = qatomic_read__nocheck(p8 + 1);
+
+        if (HOST_BIG_ENDIAN) {
+            r = ((__uint128_t)a << 64) | b;
+        } else {
+            r = ((__uint128_t)b << 64) | a;
+        }
+    } else {
+        __uint128_t *p16 = __builtin_assume_aligned(pv, 16, 0);
+        r = qatomic_read__nocheck(p16);
+    }
+    return r >> shr;
+#else
+    qemu_build_not_reached();
+#endif
+}
+
+/**
+ * load_atom_4_by_2:
+ * @pv: host address
+ *
+ * Load 4 bytes from @pv, with two 2-byte atomic loads.
+ */
+static inline uint32_t load_atom_4_by_2(void *pv)
+{
+    uint32_t a = load_atomic2(pv);
+    uint32_t b = load_atomic2(pv + 2);
+
+    if (HOST_BIG_ENDIAN) {
+        return (a << 16) | b;
+    } else {
+        return (b << 16) | a;
+    }
+}
+
+/**
+ * load_atom_8_by_2:
+ * @pv: host address
+ *
+ * Load 8 bytes from @pv, with four 2-byte atomic loads.
+ */
+static inline uint64_t load_atom_8_by_2(void *pv)
+{
+    uint32_t a = load_atom_4_by_2(pv);
+    uint32_t b = load_atom_4_by_2(pv + 4);
+
+    if (HOST_BIG_ENDIAN) {
+        return ((uint64_t)a << 32) | b;
+    } else {
+        return ((uint64_t)b << 32) | a;
+    }
+}
+
+/**
+ * load_atom_8_by_4:
+ * @pv: host address
+ *
+ * Load 8 bytes from @pv, with two 4-byte atomic loads.
+ */
+static inline uint64_t load_atom_8_by_4(void *pv)
+{
+    uint32_t a = load_atomic4(pv);
+    uint32_t b = load_atomic4(pv + 4);
+
+    if (HOST_BIG_ENDIAN) {
+        return ((uint64_t)a << 32) | b;
+    } else {
+        return ((uint64_t)b << 32) | a;
+    }
+}
+
+/**
+ * load_atom_2:
+ * @p: host address
+ * @memop: the full memory op
+ *
+ * Load 2 bytes from @p, honoring the atomicity of @memop.
+ */
+static uint16_t load_atom_2(CPUArchState *env, uintptr_t ra,
+                            void *pv, MemOp memop)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int atmax;
+
+    if (likely((pi & 1) == 0)) {
+        return load_atomic2(pv);
+    }
+    if (HAVE_al16_fast) {
+        return load_atom_extract_al16_or_al8(pv, 2);
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+    switch (atmax) {
+    case MO_8:
+        return lduw_he_p(pv);
+    case MO_16:
+        /* The only case remaining is MO_ATOM_WITHIN16. */
+        if (!HAVE_al8_fast && (pi & 3) == 1) {
+            /* Big or little endian, we want the middle two bytes. */
+            return load_atomic4(pv - 1) >> 8;
+        }
+        if (unlikely((pi & 15) != 7)) {
+            return load_atom_extract_al8_or_exit(env, ra, pv, 2);
+        }
+        return load_atom_extract_al16_or_exit(env, ra, pv, 2);
+    default:
+        g_assert_not_reached();
+    }
+}
+
+/**
+ * load_atom_4:
+ * @p: host address
+ * @memop: the full memory op
+ *
+ * Load 4 bytes from @p, honoring the atomicity of @memop.
+ */
+static uint32_t load_atom_4(CPUArchState *env, uintptr_t ra,
+                            void *pv, MemOp memop)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int atmax;
+
+    if (likely((pi & 3) == 0)) {
+        return load_atomic4(pv);
+    }
+    if (HAVE_al16_fast) {
+        return load_atom_extract_al16_or_al8(pv, 4);
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+    switch (atmax) {
+    case MO_8:
+    case MO_16:
+    case -MO_16:
+        /*
+         * For MO_ATOM_IFALIGN, this is more atomicity than required,
+         * but it's trivially supported on all hosts, better than 4
+         * individual byte loads (when the host requires alignment),
+         * and overlaps with the MO_ATOM_SUBALIGN case of p % 2 == 0.
+         */
+        return load_atom_extract_al4x2(pv);
+    case MO_32:
+        if (!(pi & 4)) {
+            return load_atom_extract_al8_or_exit(env, ra, pv, 4);
+        }
+        return load_atom_extract_al16_or_exit(env, ra, pv, 4);
+    default:
+        g_assert_not_reached();
+    }
+}
+
+/**
+ * load_atom_8:
+ * @p: host address
+ * @memop: the full memory op
+ *
+ * Load 8 bytes from @p, honoring the atomicity of @memop.
+ */
+static uint64_t load_atom_8(CPUArchState *env, uintptr_t ra,
+                            void *pv, MemOp memop)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int atmax;
+
+    /*
+     * If the host does not support 8-byte atomics, wait until we have
+     * examined the atomicity parameters below.
+     */
+    if (HAVE_al8 && likely((pi & 7) == 0)) {
+        return load_atomic8(pv);
+    }
+    if (HAVE_al16_fast) {
+        return load_atom_extract_al16_or_al8(pv, 8);
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+    if (atmax == MO_64) {
+        if (!HAVE_al8 && (pi & 7) == 0) {
+            load_atomic8_or_exit(env, ra, pv);
+        }
+        return load_atom_extract_al16_or_exit(env, ra, pv, 8);
+    }
+    if (HAVE_al8_fast) {
+        return load_atom_extract_al8x2(pv);
+    }
+    switch (atmax) {
+    case MO_8:
+        return ldq_he_p(pv);
+    case MO_16:
+        return load_atom_8_by_2(pv);
+    case MO_32:
+        return load_atom_8_by_4(pv);
+    case -MO_32:
+        if (HAVE_al8) {
+            return load_atom_extract_al8x2(pv);
+        }
+        cpu_loop_exit_atomic(env_cpu(env), ra);
+    default:
+        g_assert_not_reached();
+    }
+}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 07/57] accel/tcg: Honor atomicity of stores
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (5 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 06/57] accel/tcg: Honor atomicity of loads Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05  9:28   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 08/57] target/loongarch: Do not include tcg-ldst.h Richard Henderson
                   ` (50 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/cputlb.c             | 103 +++----
 accel/tcg/user-exec.c          |  12 +-
 accel/tcg/ldst_atomicity.c.inc | 491 +++++++++++++++++++++++++++++++++
 3 files changed, 540 insertions(+), 66 deletions(-)

diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 6f3a419fe8..6b8b472a11 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -2593,36 +2593,6 @@ Int128 cpu_ld16_le_mmu(CPUArchState *env, abi_ptr addr,
  * Store Helpers
  */
 
-static inline void QEMU_ALWAYS_INLINE
-store_memop(void *haddr, uint64_t val, MemOp op)
-{
-    switch (op) {
-    case MO_UB:
-        stb_p(haddr, val);
-        break;
-    case MO_BEUW:
-        stw_be_p(haddr, val);
-        break;
-    case MO_LEUW:
-        stw_le_p(haddr, val);
-        break;
-    case MO_BEUL:
-        stl_be_p(haddr, val);
-        break;
-    case MO_LEUL:
-        stl_le_p(haddr, val);
-        break;
-    case MO_BEUQ:
-        stq_be_p(haddr, val);
-        break;
-    case MO_LEUQ:
-        stq_le_p(haddr, val);
-        break;
-    default:
-        qemu_build_not_reached();
-    }
-}
-
 /**
  * do_st_mmio_leN:
  * @env: cpu context
@@ -2649,38 +2619,51 @@ static uint64_t do_st_mmio_leN(CPUArchState *env, MMULookupPageData *p,
     return val_le;
 }
 
-/**
- * do_st_bytes_leN:
- * @p: translation parameters
- * @val_le: data to store
- *
- * Store @p->size bytes at @p->haddr, which is RAM.
- * The bytes to store are extracted in little-endian order from @val_le;
- * return the bytes of @val_le beyond @p->size that have not been stored.
- */
-static uint64_t do_st_bytes_leN(MMULookupPageData *p, uint64_t val_le)
-{
-    uint8_t *haddr = p->haddr;
-    int i, size = p->size;
-
-    for (i = 0; i < size; i++, val_le >>= 8) {
-        haddr[i] = val_le;
-    }
-    return val_le;
-}
-
 /*
  * Wrapper for the above.
  */
 static uint64_t do_st_leN(CPUArchState *env, MMULookupPageData *p,
-                          uint64_t val_le, int mmu_idx, uintptr_t ra)
+                          uint64_t val_le, int mmu_idx,
+                          MemOp mop, uintptr_t ra)
 {
+    MemOp atmax;
+
     if (unlikely(p->flags & TLB_MMIO)) {
         return do_st_mmio_leN(env, p, val_le, mmu_idx, ra);
     } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
         return val_le >> (p->size * 8);
-    } else {
-        return do_st_bytes_leN(p, val_le);
+    }
+
+    switch (mop & MO_ATOM_MASK) {
+    case MO_ATOM_WITHIN16:
+        /*
+         * It is a given that we cross a page and therefore there is no
+         * atomicity for the load as a whole, but there may be a subobject
+         * as defined by ATMAX which does not cross a 16-byte boundary.
+         */
+        atmax = mop & MO_ATMAX_MASK;
+        if (atmax == MO_ATMAX_SIZE) {
+            atmax = mop & MO_SIZE;
+        } else {
+            atmax >>= MO_ATMAX_SHIFT;
+        }
+        if (unlikely(p->size >= (1 << atmax))) {
+            if (!HAVE_al8_fast && p->size <= 4) {
+                return store_whole_le4(p->haddr, p->size, val_le);
+            } else if (HAVE_al8) {
+                return store_whole_le8(p->haddr, p->size, val_le);
+            } else {
+                cpu_loop_exit_atomic(env_cpu(env), ra);
+            }
+        }
+        /* fall through */
+    case MO_ATOM_IFALIGN:
+    case MO_ATOM_NONE:
+        return store_bytes_leN(p->haddr, p->size, val_le);
+    case MO_ATOM_SUBALIGN:
+        return store_parts_leN(p->haddr, p->size, val_le);
+    default:
+        g_assert_not_reached();
     }
 }
 
@@ -2708,7 +2691,7 @@ static void do_st_2(CPUArchState *env, MMULookupPageData *p, uint16_t val,
         if (memop & MO_BSWAP) {
             val = bswap16(val);
         }
-        store_memop(p->haddr, val, MO_UW);
+        store_atom_2(env, ra, p->haddr, memop, val);
     }
 }
 
@@ -2724,7 +2707,7 @@ static void do_st_4(CPUArchState *env, MMULookupPageData *p, uint32_t val,
         if (memop & MO_BSWAP) {
             val = bswap32(val);
         }
-        store_memop(p->haddr, val, MO_UL);
+        store_atom_4(env, ra, p->haddr, memop, val);
     }
 }
 
@@ -2740,7 +2723,7 @@ static void do_st_8(CPUArchState *env, MMULookupPageData *p, uint64_t val,
         if (memop & MO_BSWAP) {
             val = bswap64(val);
         }
-        store_memop(p->haddr, val, MO_UQ);
+        store_atom_8(env, ra, p->haddr, memop, val);
     }
 }
 
@@ -2809,8 +2792,8 @@ static void do_st4_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
     if ((l.memop & MO_BSWAP) != MO_LE) {
         val = bswap32(val);
     }
-    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
-    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
+    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, l.memop, ra);
 }
 
 void helper_le_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
@@ -2843,8 +2826,8 @@ static void do_st8_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
     if ((l.memop & MO_BSWAP) != MO_LE) {
         val = bswap64(val);
     }
-    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, ra);
-    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, ra);
+    val = do_st_leN(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+    (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, l.memop, ra);
 }
 
 void helper_le_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index fefc83cc8c..b89fa35a83 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -1086,7 +1086,7 @@ void cpu_stw_be_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
 
     validate_memop(oi, MO_BEUW);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    stw_be_p(haddr, val);
+    store_atom_2(env, ra, haddr, get_memop(oi), be16_to_cpu(val));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
@@ -1098,7 +1098,7 @@ void cpu_stl_be_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
 
     validate_memop(oi, MO_BEUL);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    stl_be_p(haddr, val);
+    store_atom_4(env, ra, haddr, get_memop(oi), be32_to_cpu(val));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
@@ -1110,7 +1110,7 @@ void cpu_stq_be_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
 
     validate_memop(oi, MO_BEUQ);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    stq_be_p(haddr, val);
+    store_atom_8(env, ra, haddr, get_memop(oi), be64_to_cpu(val));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
@@ -1122,7 +1122,7 @@ void cpu_stw_le_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
 
     validate_memop(oi, MO_LEUW);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    stw_le_p(haddr, val);
+    store_atom_2(env, ra, haddr, get_memop(oi), le16_to_cpu(val));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
@@ -1134,7 +1134,7 @@ void cpu_stl_le_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
 
     validate_memop(oi, MO_LEUL);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    stl_le_p(haddr, val);
+    store_atom_4(env, ra, haddr, get_memop(oi), le32_to_cpu(val));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
@@ -1146,7 +1146,7 @@ void cpu_stq_le_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
 
     validate_memop(oi, MO_LEUQ);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    stq_le_p(haddr, val);
+    store_atom_8(env, ra, haddr, get_memop(oi), le64_to_cpu(val));
     clear_helper_retaddr();
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index 5169073431..07abbdee3f 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -21,6 +21,12 @@
 #else
 # define HAVE_al16_fast    false
 #endif
+#if defined(CONFIG_ATOMIC128) || defined(CONFIG_CMPXCHG128)
+# define HAVE_al16         true
+#else
+# define HAVE_al16         false
+#endif
+
 
 /**
  * required_atomicity:
@@ -548,3 +554,488 @@ static uint64_t load_atom_8(CPUArchState *env, uintptr_t ra,
         g_assert_not_reached();
     }
 }
+
+/**
+ * store_atomic2:
+ * @pv: host address
+ * @val: value to store
+ *
+ * Atomically store 2 aligned bytes to @pv.
+ */
+static inline void store_atomic2(void *pv, uint16_t val)
+{
+    uint16_t *p = __builtin_assume_aligned(pv, 2);
+    qatomic_set(p, val);
+}
+
+/**
+ * store_atomic4:
+ * @pv: host address
+ * @val: value to store
+ *
+ * Atomically store 4 aligned bytes to @pv.
+ */
+static inline void store_atomic4(void *pv, uint32_t val)
+{
+    uint32_t *p = __builtin_assume_aligned(pv, 4);
+    qatomic_set(p, val);
+}
+
+/**
+ * store_atomic8:
+ * @pv: host address
+ * @val: value to store
+ *
+ * Atomically store 8 aligned bytes to @pv.
+ */
+static inline void store_atomic8(void *pv, uint64_t val)
+{
+    uint64_t *p = __builtin_assume_aligned(pv, 8);
+
+    qemu_build_assert(HAVE_al8);
+    qatomic_set__nocheck(p, val);
+}
+
+/**
+ * store_atom_4x2
+ */
+static inline void store_atom_4_by_2(void *pv, uint32_t val)
+{
+    store_atomic2(pv, val >> (HOST_BIG_ENDIAN ? 16 : 0));
+    store_atomic2(pv + 2, val >> (HOST_BIG_ENDIAN ? 0 : 16));
+}
+
+/**
+ * store_atom_8_by_2
+ */
+static inline void store_atom_8_by_2(void *pv, uint64_t val)
+{
+    store_atom_4_by_2(pv, val >> (HOST_BIG_ENDIAN ? 32 : 0));
+    store_atom_4_by_2(pv + 4, val >> (HOST_BIG_ENDIAN ? 0 : 32));
+}
+
+/**
+ * store_atom_8_by_4
+ */
+static inline void store_atom_8_by_4(void *pv, uint64_t val)
+{
+    store_atomic4(pv, val >> (HOST_BIG_ENDIAN ? 32 : 0));
+    store_atomic4(pv + 4, val >> (HOST_BIG_ENDIAN ? 0 : 32));
+}
+
+/**
+ * store_atom_insert_al4:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p, masked by @msk.
+ */
+static void store_atom_insert_al4(uint32_t *p, uint32_t val, uint32_t msk)
+{
+    uint32_t old, new;
+
+    p = __builtin_assume_aligned(p, 4);
+    old = qatomic_read(p);
+    do {
+        new = (old & ~msk) | val;
+    } while (!__atomic_compare_exchange_n(p, &old, new, true,
+                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
+}
+
+/**
+ * store_atom_insert_al8:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
+{
+    uint64_t old, new;
+
+    qemu_build_assert(HAVE_al8);
+    p = __builtin_assume_aligned(p, 8);
+    old = qatomic_read__nocheck(p);
+    do {
+        new = (old & ~msk) | val;
+    } while (!__atomic_compare_exchange_n(p, &old, new, true,
+                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
+}
+
+/**
+ * store_atom_insert_al16:
+ * @p: host address
+ * @val: shifted value to store
+ * @msk: mask for value to store
+ *
+ * Atomically store @val to @p masked by @msk.
+ */
+static void store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
+{
+#if defined(CONFIG_ATOMIC128)
+    __uint128_t *pu, old, new;
+
+    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
+    pu = __builtin_assume_aligned(ps, 16);
+    old = *pu;
+    do {
+        new = (old & ~msk.u) | val.u;
+    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
+                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
+#elif defined(CONFIG_CMPXCHG128)
+    __uint128_t *pu, old, new;
+
+    /*
+     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
+     * defer to libatomic, so we must use __sync_val_compare_and_swap_16
+     * and accept the sequential consistency that comes with it.
+     */
+    pu = __builtin_assume_aligned(ps, 16);
+    do {
+        old = *pu;
+        new = (old & ~msk.u) | val.u;
+    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
+#else
+    qemu_build_not_reached();
+#endif
+}
+
+/**
+ * store_bytes_leN:
+ * @pv: host address
+ * @size: number of bytes to store
+ * @val_le: data to store
+ *
+ * Store @size bytes at @p.  The bytes to store are extracted in little-endian order
+ * from @val_le; return the bytes of @val_le beyond @size that have not been stored.
+ */
+static uint64_t store_bytes_leN(void *pv, int size, uint64_t val_le)
+{
+    uint8_t *p = pv;
+    for (int i = 0; i < size; i++, val_le >>= 8) {
+        p[i] = val_le;
+    }
+    return val_le;
+}
+
+/**
+ * store_parts_leN
+ * @pv: host address
+ * @size: number of bytes to store
+ * @val_le: data to store
+ *
+ * As store_bytes_leN, but atomically on each aligned part.
+ */
+G_GNUC_UNUSED
+static uint64_t store_parts_leN(void *pv, int size, uint64_t val_le)
+{
+    do {
+        int n;
+
+        /* Find minimum of alignment and size */
+        switch (((uintptr_t)pv | size) & 7) {
+        case 4:
+            store_atomic4(pv, le32_to_cpu(val_le));
+            val_le >>= 32;
+            n = 4;
+            break;
+        case 2:
+        case 6:
+            store_atomic2(pv, le16_to_cpu(val_le));
+            val_le >>= 16;
+            n = 2;
+            break;
+        default:
+            *(uint8_t *)pv = val_le;
+            val_le >>= 8;
+            n = 1;
+            break;
+        case 0:
+            g_assert_not_reached();
+        }
+        pv += n;
+        size -= n;
+    } while (size != 0);
+
+    return val_le;
+}
+
+/**
+ * store_whole_le4
+ * @pv: host address
+ * @size: number of bytes to store
+ * @val_le: data to store
+ *
+ * As store_bytes_leN, but atomically as a whole.
+ * Four aligned bytes are guaranteed to cover the store.
+ */
+static uint64_t store_whole_le4(void *pv, int size, uint64_t val_le)
+{
+    int sz = size * 8;
+    int o = (uintptr_t)pv & 3;
+    int sh = o * 8;
+    uint32_t m = MAKE_64BIT_MASK(0, sz);
+    uint32_t v;
+
+    if (HOST_BIG_ENDIAN) {
+        v = bswap32(val_le) >> sh;
+        m = bswap32(m) >> sh;
+    } else {
+        v = val_le << sh;
+        m <<= sh;
+    }
+    store_atom_insert_al4(pv - o, v, m);
+    return val_le >> sz;
+}
+
+/**
+ * store_whole_le8
+ * @pv: host address
+ * @size: number of bytes to store
+ * @val_le: data to store
+ *
+ * As store_bytes_leN, but atomically as a whole.
+ * Eight aligned bytes are guaranteed to cover the store.
+ */
+static uint64_t store_whole_le8(void *pv, int size, uint64_t val_le)
+{
+    int sz = size * 8;
+    int o = (uintptr_t)pv & 7;
+    int sh = o * 8;
+    uint64_t m = MAKE_64BIT_MASK(0, sz);
+    uint64_t v;
+
+    qemu_build_assert(HAVE_al8);
+    if (HOST_BIG_ENDIAN) {
+        v = bswap64(val_le) >> sh;
+        m = bswap64(m) >> sh;
+    } else {
+        v = val_le << sh;
+        m <<= sh;
+    }
+    store_atom_insert_al8(pv - o, v, m);
+    return val_le >> sz;
+}
+
+/**
+ * store_whole_le16
+ * @pv: host address
+ * @size: number of bytes to store
+ * @val_le: data to store
+ *
+ * As store_bytes_leN, but atomically as a whole.
+ * 16 aligned bytes are guaranteed to cover the store.
+ */
+static uint64_t store_whole_le16(void *pv, int size, Int128 val_le)
+{
+    int sz = size * 8;
+    int o = (uintptr_t)pv & 15;
+    int sh = o * 8;
+    Int128 m, v;
+
+    qemu_build_assert(HAVE_al16);
+
+    /* Like MAKE_64BIT_MASK(0, sz), but larger. */
+    if (sz <= 64) {
+        m = int128_make64(MAKE_64BIT_MASK(0, sz));
+    } else {
+        m = int128_make128(-1, MAKE_64BIT_MASK(0, sz - 64));
+    }
+
+    if (HOST_BIG_ENDIAN) {
+        v = int128_urshift(bswap128(val_le), sh);
+        m = int128_urshift(bswap128(m), sh);
+    } else {
+        v = int128_lshift(val_le, sh);
+        m = int128_lshift(m, sh);
+    }
+    store_atom_insert_al16(pv - o, v, m);
+
+    /* Unused if sz <= 64. */
+    return int128_gethi(val_le) >> (sz - 64);
+}
+
+/**
+ * store_atom_2:
+ * @p: host address
+ * @val: the value to store
+ * @memop: the full memory op
+ *
+ * Store 2 bytes to @p, honoring the atomicity of @memop.
+ */
+static void store_atom_2(CPUArchState *env, uintptr_t ra,
+                         void *pv, MemOp memop, uint16_t val)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int atmax;
+
+    if (likely((pi & 1) == 0)) {
+        store_atomic2(pv, val);
+        return;
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+    if (atmax == MO_8) {
+        stw_he_p(pv, val);
+        return;
+    }
+
+    /*
+     * The only case remaining is MO_ATOM_WITHIN16.
+     * Big or little endian, we want the middle two bytes in each test.
+     */
+    if ((pi & 3) == 1) {
+        store_atom_insert_al4(pv - 1, (uint32_t)val << 8, MAKE_64BIT_MASK(8, 16));
+        return;
+    } else if ((pi & 7) == 3) {
+        if (HAVE_al8) {
+            store_atom_insert_al8(pv - 3, (uint64_t)val << 24, MAKE_64BIT_MASK(24, 16));
+            return;
+        }
+    } else if ((pi & 15) == 7) {
+        if (HAVE_al16) {
+            Int128 v = int128_lshift(int128_make64(val), 56);
+            Int128 m = int128_lshift(int128_make64(0xffff), 56);
+            store_atom_insert_al16(pv - 7, v, m);
+            return;
+        }
+    } else {
+        g_assert_not_reached();
+    }
+
+    cpu_loop_exit_atomic(env_cpu(env), ra);
+}
+
+/**
+ * store_atom_4:
+ * @p: host address
+ * @val: the value to store
+ * @memop: the full memory op
+ *
+ * Store 4 bytes to @p, honoring the atomicity of @memop.
+ */
+static void store_atom_4(CPUArchState *env, uintptr_t ra,
+                         void *pv, MemOp memop, uint32_t val)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int atmax;
+
+    if (likely((pi & 3) == 0)) {
+        store_atomic4(pv, val);
+        return;
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+    switch (atmax) {
+    case MO_8:
+        stl_he_p(pv, val);
+        return;
+    case MO_16:
+        store_atom_4_by_2(pv, val);
+        return;
+    case -MO_16:
+        {
+            uint32_t val_le = cpu_to_le32(val);
+            int s2 = pi & 3;
+            int s1 = 4 - s2;
+
+            switch (s2) {
+            case 1:
+                val_le = store_whole_le4(pv, s1, val_le);
+                *(uint8_t *)(pv + 3) = val_le;
+                break;
+            case 3:
+                *(uint8_t *)pv = val_le;
+                store_whole_le4(pv + 1, s2, val_le >> 8);
+                break;
+            case 0: /* aligned */
+            case 2: /* atmax MO_16 */
+            default:
+                g_assert_not_reached();
+            }
+        }
+        return;
+    case MO_32:
+        if ((pi & 7) < 4) {
+            if (HAVE_al8) {
+                store_whole_le8(pv, 4, cpu_to_le32(val));
+                return;
+            }
+        } else {
+            if (HAVE_al16) {
+                store_whole_le16(pv, 4, int128_make64(cpu_to_le32(val)));
+                return;
+            }
+        }
+        cpu_loop_exit_atomic(env_cpu(env), ra);
+    default:
+        g_assert_not_reached();
+    }
+}
+
+/**
+ * store_atom_8:
+ * @p: host address
+ * @val: the value to store
+ * @memop: the full memory op
+ *
+ * Store 8 bytes to @p, honoring the atomicity of @memop.
+ */
+static void store_atom_8(CPUArchState *env, uintptr_t ra,
+                         void *pv, MemOp memop, uint64_t val)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int atmax;
+
+    if (HAVE_al8 && likely((pi & 7) == 0)) {
+        store_atomic8(pv, val);
+        return;
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+    switch (atmax) {
+    case MO_8:
+        stq_he_p(pv, val);
+        return;
+    case MO_16:
+        store_atom_8_by_2(pv, val);
+        return;
+    case MO_32:
+        store_atom_8_by_4(pv, val);
+        return;
+    case -MO_32:
+        if (HAVE_al8) {
+            uint64_t val_le = cpu_to_le64(val);
+            int s2 = pi & 7;
+            int s1 = 8 - s2;
+
+            switch (s2) {
+            case 1 ... 3:
+                val_le = store_whole_le8(pv, s1, val_le);
+                store_bytes_leN(pv + s1, s2, val_le);
+                break;
+            case 5 ... 7:
+                val_le = store_bytes_leN(pv, s1, val_le);
+                store_whole_le8(pv + s1, s2, val_le);
+                break;
+            case 0: /* aligned */
+            case 4: /* atmax MO_32 */
+            default:
+                g_assert_not_reached();
+            }
+            return;
+        }
+        break;
+    case MO_64:
+        if (HAVE_al16) {
+            store_whole_le16(pv, 8, int128_make64(cpu_to_le64(val)));
+            return;
+        }
+        break;
+    default:
+        g_assert_not_reached();
+    }
+    cpu_loop_exit_atomic(env_cpu(env), ra);
+}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 08/57] target/loongarch: Do not include tcg-ldst.h
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (6 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 07/57] accel/tcg: Honor atomicity of stores Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05  9:29   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 09/57] tcg: Unify helper_{be,le}_{ld,st}* Richard Henderson
                   ` (49 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

This header is supposed to be private to tcg and in fact
does not need to be included here at all.

Reviewed-by: Song Gao <gaosong@loongson.cn>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 target/loongarch/csr_helper.c   | 1 -
 target/loongarch/iocsr_helper.c | 1 -
 2 files changed, 2 deletions(-)

diff --git a/target/loongarch/csr_helper.c b/target/loongarch/csr_helper.c
index 7e02787895..6526367946 100644
--- a/target/loongarch/csr_helper.c
+++ b/target/loongarch/csr_helper.c
@@ -15,7 +15,6 @@
 #include "exec/cpu_ldst.h"
 #include "hw/irq.h"
 #include "cpu-csr.h"
-#include "tcg/tcg-ldst.h"
 
 target_ulong helper_csrrd_pgd(CPULoongArchState *env)
 {
diff --git a/target/loongarch/iocsr_helper.c b/target/loongarch/iocsr_helper.c
index 505853e17b..dda9845d6c 100644
--- a/target/loongarch/iocsr_helper.c
+++ b/target/loongarch/iocsr_helper.c
@@ -12,7 +12,6 @@
 #include "exec/helper-proto.h"
 #include "exec/exec-all.h"
 #include "exec/cpu_ldst.h"
-#include "tcg/tcg-ldst.h"
 
 #define GET_MEMTXATTRS(cas) \
         ((MemTxAttrs){.requester_id = env_cpu(cas)->cpu_index})
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 09/57] tcg: Unify helper_{be,le}_{ld,st}*
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (7 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 08/57] target/loongarch: Do not include tcg-ldst.h Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05  9:36   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 10/57] accel/tcg: Implement helper_{ld, st}*_mmu for user-only Richard Henderson
                   ` (48 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

With the current structure of cputlb.c, there is no difference
between the little-endian and big-endian entry points, aside
from the assert.  Unify the pairs of functions.

Hoist the qemu_{ld,st}_helpers arrays to tcg.c.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/tcg/tcg-ldst.h           |  60 ++++------
 accel/tcg/cputlb.c               | 190 ++++++++++---------------------
 tcg/tcg.c                        |  21 ++++
 tcg/tci.c                        |  61 ++++------
 docs/devel/loads-stores.rst      |  36 ++----
 tcg/aarch64/tcg-target.c.inc     |  33 ------
 tcg/arm/tcg-target.c.inc         |  37 ------
 tcg/i386/tcg-target.c.inc        |  30 +----
 tcg/loongarch64/tcg-target.c.inc |  23 ----
 tcg/mips/tcg-target.c.inc        |  31 -----
 tcg/ppc/tcg-target.c.inc         |  30 +----
 tcg/riscv/tcg-target.c.inc       |  42 -------
 tcg/s390x/tcg-target.c.inc       |  31 +----
 tcg/sparc64/tcg-target.c.inc     |  32 +-----
 14 files changed, 146 insertions(+), 511 deletions(-)

diff --git a/include/tcg/tcg-ldst.h b/include/tcg/tcg-ldst.h
index 684e394b06..3d897ca942 100644
--- a/include/tcg/tcg-ldst.h
+++ b/include/tcg/tcg-ldst.h
@@ -28,51 +28,35 @@
 #ifdef CONFIG_SOFTMMU
 
 /* Value zero-extended to tcg register size.  */
-tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
-                                     MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_le_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
-uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
-                           MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_be_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
-uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
-                           MemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_ldub_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_lduw_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_ldul_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr,
+                        MemOpIdx oi, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
-tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
-                                     MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_le_ldsw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_le_ldsl_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_be_ldsw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
-tcg_target_ulong helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_ldsb_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_ldsw_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_ldsl_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr);
 
 /*
  * Value extended to at least uint32_t, so that some ABIs do not require
  * zero-extension from uint8_t or uint16_t.
  */
-void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                        MemOpIdx oi, uintptr_t retaddr);
-void helper_le_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr);
-void helper_le_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr);
-void helper_le_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                       MemOpIdx oi, uintptr_t retaddr);
-void helper_be_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr);
-void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr);
-void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                       MemOpIdx oi, uintptr_t retaddr);
+void helper_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t retaddr);
+void helper_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t retaddr);
+void helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t retaddr);
+void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                    MemOpIdx oi, uintptr_t retaddr);
 
 #else
 
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 6b8b472a11..566cf8311b 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -2011,25 +2011,6 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
     cpu_loop_exit_atomic(env_cpu(env), retaddr);
 }
 
-/*
- * Verify that we have passed the correct MemOp to the correct function.
- *
- * In the case of the helper_*_mmu functions, we will have done this by
- * using the MemOp to look up the helper during code generation.
- *
- * In the case of the cpu_*_mmu functions, this is up to the caller.
- * We could present one function to target code, and dispatch based on
- * the MemOp, but so far we have worked hard to avoid an indirect function
- * call along the memory path.
- */
-static void validate_memop(MemOpIdx oi, MemOp expected)
-{
-#ifdef CONFIG_DEBUG_TCG
-    MemOp have = get_memop(oi) & (MO_SIZE | MO_BSWAP);
-    assert(have == expected);
-#endif
-}
-
 /*
  * Load Helpers
  *
@@ -2297,10 +2278,10 @@ static uint8_t do_ld1_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
     return do_ld_1(env, &l.page[0], l.mmu_idx, access_type, ra);
 }
 
-tcg_target_ulong helper_ret_ldub_mmu(CPUArchState *env, target_ulong addr,
-                                     MemOpIdx oi, uintptr_t retaddr)
+tcg_target_ulong helper_ldub_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr)
 {
-    validate_memop(oi, MO_UB);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_8);
     return do_ld1_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
@@ -2328,17 +2309,10 @@ static uint16_t do_ld2_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
     return ret;
 }
 
-tcg_target_ulong helper_le_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
+tcg_target_ulong helper_lduw_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr)
 {
-    validate_memop(oi, MO_LEUW);
-    return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
-}
-
-tcg_target_ulong helper_be_lduw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUW);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
     return do_ld2_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
@@ -2362,17 +2336,10 @@ static uint32_t do_ld4_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
     return ret;
 }
 
-tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
+tcg_target_ulong helper_ldul_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr)
 {
-    validate_memop(oi, MO_LEUL);
-    return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
-}
-
-tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUL);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
     return do_ld4_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
@@ -2396,17 +2363,10 @@ static uint64_t do_ld8_mmu(CPUArchState *env, target_ulong addr, MemOpIdx oi,
     return ret;
 }
 
-uint64_t helper_le_ldq_mmu(CPUArchState *env, target_ulong addr,
-                           MemOpIdx oi, uintptr_t retaddr)
+uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr,
+                        MemOpIdx oi, uintptr_t retaddr)
 {
-    validate_memop(oi, MO_LEUQ);
-    return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
-}
-
-uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
-                           MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUQ);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
     return do_ld8_mmu(env, addr, oi, retaddr, MMU_DATA_LOAD);
 }
 
@@ -2415,35 +2375,22 @@ uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
  * avoid this for 64-bit data, or for 32-bit data on 32-bit host.
  */
 
-
-tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
-                                     MemOpIdx oi, uintptr_t retaddr)
+tcg_target_ulong helper_ldsb_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr)
 {
-    return (int8_t)helper_ret_ldub_mmu(env, addr, oi, retaddr);
+    return (int8_t)helper_ldub_mmu(env, addr, oi, retaddr);
 }
 
-tcg_target_ulong helper_le_ldsw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
+tcg_target_ulong helper_ldsw_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr)
 {
-    return (int16_t)helper_le_lduw_mmu(env, addr, oi, retaddr);
+    return (int16_t)helper_lduw_mmu(env, addr, oi, retaddr);
 }
 
-tcg_target_ulong helper_be_ldsw_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
+tcg_target_ulong helper_ldsl_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t retaddr)
 {
-    return (int16_t)helper_be_lduw_mmu(env, addr, oi, retaddr);
-}
-
-tcg_target_ulong helper_le_ldsl_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
-{
-    return (int32_t)helper_le_ldul_mmu(env, addr, oi, retaddr);
-}
-
-tcg_target_ulong helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr,
-                                    MemOpIdx oi, uintptr_t retaddr)
-{
-    return (int32_t)helper_be_ldul_mmu(env, addr, oi, retaddr);
+    return (int32_t)helper_ldul_mmu(env, addr, oi, retaddr);
 }
 
 /*
@@ -2459,7 +2406,7 @@ uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr, MemOpIdx oi, uintptr_t ra)
 {
     uint8_t ret;
 
-    validate_memop(oi, MO_UB);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_UB);
     ret = do_ld1_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
     plugin_load_cb(env, addr, oi);
     return ret;
@@ -2470,7 +2417,7 @@ uint16_t cpu_ldw_be_mmu(CPUArchState *env, abi_ptr addr,
 {
     uint16_t ret;
 
-    validate_memop(oi, MO_BEUW);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_BEUW);
     ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
     plugin_load_cb(env, addr, oi);
     return ret;
@@ -2481,7 +2428,7 @@ uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
 {
     uint32_t ret;
 
-    validate_memop(oi, MO_BEUL);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_BEUL);
     ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
     plugin_load_cb(env, addr, oi);
     return ret;
@@ -2492,7 +2439,7 @@ uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
 {
     uint64_t ret;
 
-    validate_memop(oi, MO_BEUQ);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_BEUQ);
     ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
     plugin_load_cb(env, addr, oi);
     return ret;
@@ -2503,7 +2450,7 @@ uint16_t cpu_ldw_le_mmu(CPUArchState *env, abi_ptr addr,
 {
     uint16_t ret;
 
-    validate_memop(oi, MO_LEUW);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_LEUW);
     ret = do_ld2_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
     plugin_load_cb(env, addr, oi);
     return ret;
@@ -2514,7 +2461,7 @@ uint32_t cpu_ldl_le_mmu(CPUArchState *env, abi_ptr addr,
 {
     uint32_t ret;
 
-    validate_memop(oi, MO_LEUL);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_LEUL);
     ret = do_ld4_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
     plugin_load_cb(env, addr, oi);
     return ret;
@@ -2525,7 +2472,7 @@ uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
 {
     uint64_t ret;
 
-    validate_memop(oi, MO_LEUQ);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_LEUQ);
     ret = do_ld8_mmu(env, addr, oi, ra, MMU_DATA_LOAD);
     plugin_load_cb(env, addr, oi);
     return ret;
@@ -2553,8 +2500,8 @@ Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
     mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
     new_oi = make_memop_idx(mop, mmu_idx);
 
-    h = helper_be_ldq_mmu(env, addr, new_oi, ra);
-    l = helper_be_ldq_mmu(env, addr + 8, new_oi, ra);
+    h = helper_ldq_mmu(env, addr, new_oi, ra);
+    l = helper_ldq_mmu(env, addr + 8, new_oi, ra);
 
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     return int128_make128(l, h);
@@ -2582,8 +2529,8 @@ Int128 cpu_ld16_le_mmu(CPUArchState *env, abi_ptr addr,
     mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
     new_oi = make_memop_idx(mop, mmu_idx);
 
-    l = helper_le_ldq_mmu(env, addr, new_oi, ra);
-    h = helper_le_ldq_mmu(env, addr + 8, new_oi, ra);
+    l = helper_ldq_mmu(env, addr, new_oi, ra);
+    h = helper_ldq_mmu(env, addr + 8, new_oi, ra);
 
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     return int128_make128(l, h);
@@ -2727,13 +2674,13 @@ static void do_st_8(CPUArchState *env, MMULookupPageData *p, uint64_t val,
     }
 }
 
-void helper_ret_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                        MemOpIdx oi, uintptr_t ra)
+void helper_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t ra)
 {
     MMULookupLocals l;
     bool crosspage;
 
-    validate_memop(oi, MO_UB);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_8);
     crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
     tcg_debug_assert(!crosspage);
 
@@ -2762,17 +2709,10 @@ static void do_st2_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
     do_st_1(env, &l.page[1], b, l.mmu_idx, ra);
 }
 
-void helper_le_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr)
+void helper_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t retaddr)
 {
-    validate_memop(oi, MO_LEUW);
-    do_st2_mmu(env, addr, val, oi, retaddr);
-}
-
-void helper_be_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUW);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_16);
     do_st2_mmu(env, addr, val, oi, retaddr);
 }
 
@@ -2796,17 +2736,10 @@ static void do_st4_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
     (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, l.memop, ra);
 }
 
-void helper_le_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr)
+void helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t retaddr)
 {
-    validate_memop(oi, MO_LEUL);
-    do_st4_mmu(env, addr, val, oi, retaddr);
-}
-
-void helper_be_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
-                       MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUL);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_32);
     do_st4_mmu(env, addr, val, oi, retaddr);
 }
 
@@ -2830,17 +2763,10 @@ static void do_st8_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
     (void) do_st_leN(env, &l.page[1], val, l.mmu_idx, l.memop, ra);
 }
 
-void helper_le_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                       MemOpIdx oi, uintptr_t retaddr)
+void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                    MemOpIdx oi, uintptr_t retaddr)
 {
-    validate_memop(oi, MO_LEUQ);
-    do_st8_mmu(env, addr, val, oi, retaddr);
-}
-
-void helper_be_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
-                       MemOpIdx oi, uintptr_t retaddr)
-{
-    validate_memop(oi, MO_BEUQ);
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_64);
     do_st8_mmu(env, addr, val, oi, retaddr);
 }
 
@@ -2856,49 +2782,55 @@ static void plugin_store_cb(CPUArchState *env, abi_ptr addr, MemOpIdx oi)
 void cpu_stb_mmu(CPUArchState *env, target_ulong addr, uint8_t val,
                  MemOpIdx oi, uintptr_t retaddr)
 {
-    helper_ret_stb_mmu(env, addr, val, oi, retaddr);
+    helper_stb_mmu(env, addr, val, oi, retaddr);
     plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stw_be_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    helper_be_stw_mmu(env, addr, val, oi, retaddr);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_BEUW);
+    do_st2_mmu(env, addr, val, oi, retaddr);
     plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stl_be_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    helper_be_stl_mmu(env, addr, val, oi, retaddr);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_BEUL);
+    do_st4_mmu(env, addr, val, oi, retaddr);
     plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stq_be_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    helper_be_stq_mmu(env, addr, val, oi, retaddr);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_BEUQ);
+    do_st8_mmu(env, addr, val, oi, retaddr);
     plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stw_le_mmu(CPUArchState *env, target_ulong addr, uint16_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    helper_le_stw_mmu(env, addr, val, oi, retaddr);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_LEUW);
+    do_st2_mmu(env, addr, val, oi, retaddr);
     plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stl_le_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    helper_le_stl_mmu(env, addr, val, oi, retaddr);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_LEUL);
+    do_st4_mmu(env, addr, val, oi, retaddr);
     plugin_store_cb(env, addr, oi);
 }
 
 void cpu_stq_le_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr)
 {
-    helper_le_stq_mmu(env, addr, val, oi, retaddr);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == MO_LEUQ);
+    do_st8_mmu(env, addr, val, oi, retaddr);
     plugin_store_cb(env, addr, oi);
 }
 
@@ -2923,8 +2855,8 @@ void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
     mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
     new_oi = make_memop_idx(mop, mmu_idx);
 
-    helper_be_stq_mmu(env, addr, int128_gethi(val), new_oi, ra);
-    helper_be_stq_mmu(env, addr + 8, int128_getlo(val), new_oi, ra);
+    helper_stq_mmu(env, addr, int128_gethi(val), new_oi, ra);
+    helper_stq_mmu(env, addr + 8, int128_getlo(val), new_oi, ra);
 
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
@@ -2950,8 +2882,8 @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
     mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
     new_oi = make_memop_idx(mop, mmu_idx);
 
-    helper_le_stq_mmu(env, addr, int128_getlo(val), new_oi, ra);
-    helper_le_stq_mmu(env, addr + 8, int128_gethi(val), new_oi, ra);
+    helper_stq_mmu(env, addr, int128_getlo(val), new_oi, ra);
+    helper_stq_mmu(env, addr + 8, int128_gethi(val), new_oi, ra);
 
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 748be8426a..12510c78c6 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -197,6 +197,27 @@ static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *l,
                                    const TCGLdstHelperParam *p)
     __attribute__((unused));
 
+#ifdef CONFIG_SOFTMMU
+static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
+    [MO_UB] = helper_ldub_mmu,
+    [MO_SB] = helper_ldsb_mmu,
+    [MO_UW] = helper_lduw_mmu,
+    [MO_SW] = helper_ldsw_mmu,
+    [MO_UL] = helper_ldul_mmu,
+    [MO_UQ] = helper_ldq_mmu,
+#if TCG_TARGET_REG_BITS == 64
+    [MO_SL] = helper_ldsl_mmu,
+#endif
+};
+
+static void * const qemu_st_helpers[MO_SIZE + 1] = {
+    [MO_8]  = helper_stb_mmu,
+    [MO_16] = helper_stw_mmu,
+    [MO_32] = helper_stl_mmu,
+    [MO_64] = helper_stq_mmu,
+};
+#endif
+
 TCGContext tcg_init_ctx;
 __thread TCGContext *tcg_ctx;
 
diff --git a/tcg/tci.c b/tcg/tci.c
index fc67e7e767..5bde2e1f2e 100644
--- a/tcg/tci.c
+++ b/tcg/tci.c
@@ -293,31 +293,21 @@ static uint64_t tci_qemu_ld(CPUArchState *env, target_ulong taddr,
     uintptr_t ra = (uintptr_t)tb_ptr;
 
 #ifdef CONFIG_SOFTMMU
-    switch (mop & (MO_BSWAP | MO_SSIZE)) {
+    switch (mop & MO_SSIZE) {
     case MO_UB:
-        return helper_ret_ldub_mmu(env, taddr, oi, ra);
+        return helper_ldub_mmu(env, taddr, oi, ra);
     case MO_SB:
-        return helper_ret_ldsb_mmu(env, taddr, oi, ra);
-    case MO_LEUW:
-        return helper_le_lduw_mmu(env, taddr, oi, ra);
-    case MO_LESW:
-        return helper_le_ldsw_mmu(env, taddr, oi, ra);
-    case MO_LEUL:
-        return helper_le_ldul_mmu(env, taddr, oi, ra);
-    case MO_LESL:
-        return helper_le_ldsl_mmu(env, taddr, oi, ra);
-    case MO_LEUQ:
-        return helper_le_ldq_mmu(env, taddr, oi, ra);
-    case MO_BEUW:
-        return helper_be_lduw_mmu(env, taddr, oi, ra);
-    case MO_BESW:
-        return helper_be_ldsw_mmu(env, taddr, oi, ra);
-    case MO_BEUL:
-        return helper_be_ldul_mmu(env, taddr, oi, ra);
-    case MO_BESL:
-        return helper_be_ldsl_mmu(env, taddr, oi, ra);
-    case MO_BEUQ:
-        return helper_be_ldq_mmu(env, taddr, oi, ra);
+        return helper_ldsb_mmu(env, taddr, oi, ra);
+    case MO_UW:
+        return helper_lduw_mmu(env, taddr, oi, ra);
+    case MO_SW:
+        return helper_ldsw_mmu(env, taddr, oi, ra);
+    case MO_UL:
+        return helper_ldul_mmu(env, taddr, oi, ra);
+    case MO_SL:
+        return helper_ldsl_mmu(env, taddr, oi, ra);
+    case MO_UQ:
+        return helper_ldq_mmu(env, taddr, oi, ra);
     default:
         g_assert_not_reached();
     }
@@ -382,27 +372,18 @@ static void tci_qemu_st(CPUArchState *env, target_ulong taddr, uint64_t val,
     uintptr_t ra = (uintptr_t)tb_ptr;
 
 #ifdef CONFIG_SOFTMMU
-    switch (mop & (MO_BSWAP | MO_SIZE)) {
+    switch (mop & MO_SIZE) {
     case MO_UB:
-        helper_ret_stb_mmu(env, taddr, val, oi, ra);
+        helper_stb_mmu(env, taddr, val, oi, ra);
         break;
-    case MO_LEUW:
-        helper_le_stw_mmu(env, taddr, val, oi, ra);
+    case MO_UW:
+        helper_stw_mmu(env, taddr, val, oi, ra);
         break;
-    case MO_LEUL:
-        helper_le_stl_mmu(env, taddr, val, oi, ra);
+    case MO_UL:
+        helper_stl_mmu(env, taddr, val, oi, ra);
         break;
-    case MO_LEUQ:
-        helper_le_stq_mmu(env, taddr, val, oi, ra);
-        break;
-    case MO_BEUW:
-        helper_be_stw_mmu(env, taddr, val, oi, ra);
-        break;
-    case MO_BEUL:
-        helper_be_stl_mmu(env, taddr, val, oi, ra);
-        break;
-    case MO_BEUQ:
-        helper_be_stq_mmu(env, taddr, val, oi, ra);
+    case MO_UQ:
+        helper_stq_mmu(env, taddr, val, oi, ra);
         break;
     default:
         g_assert_not_reached();
diff --git a/docs/devel/loads-stores.rst b/docs/devel/loads-stores.rst
index ad5dfe133e..d2cefc77a2 100644
--- a/docs/devel/loads-stores.rst
+++ b/docs/devel/loads-stores.rst
@@ -297,31 +297,20 @@ swap: ``translator_ld{sign}{size}_swap(env, ptr, swap)``
 Regexes for git grep
  - ``\<translator_ld[us]\?[bwlq]\(_swap\)\?\>``
 
-``helper_*_{ld,st}*_mmu``
+``helper_{ld,st}*_mmu``
 ~~~~~~~~~~~~~~~~~~~~~~~~~
 
 These functions are intended primarily to be called by the code
-generated by the TCG backend. They may also be called by target
-CPU helper function code. Like the ``cpu_{ld,st}_mmuidx_ra`` functions
-they perform accesses by guest virtual address, with a given ``mmuidx``.
+generated by the TCG backend.  Like the ``cpu_{ld,st}_mmu`` functions
+they perform accesses by guest virtual address, with a given ``MemOpIdx``.
 
-These functions specify an ``opindex`` parameter which encodes
-(among other things) the mmu index to use for the access.  This parameter
-should be created by calling ``make_memop_idx()``.
+They differ from ``cpu_{ld,st}_mmu`` in that they take the endianness
+of the operation only from the MemOpIdx, and loads extend the return
+value to the size of a host general register (``tcg_target_ulong``).
 
-The ``retaddr`` parameter should be the result of GETPC() called directly
-from the top level HELPER(foo) function (or 0 if no guest CPU state
-unwinding is required).
+load: ``helper_ld{sign}{size}_mmu(env, addr, opindex, retaddr)``
 
-**TODO** The names of these functions are a bit odd for historical
-reasons because they were originally expected to be called only from
-within generated code. We should rename them to bring them more in
-line with the other memory access functions. The explicit endianness
-is the only feature they have beyond ``*_mmuidx_ra``.
-
-load: ``helper_{endian}_ld{sign}{size}_mmu(env, addr, opindex, retaddr)``
-
-store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)``
+store: ``helper_{size}_mmu(env, addr, val, opindex, retaddr)``
 
 ``sign``
  - (empty) : for 32 or 64 bit sizes
@@ -334,14 +323,9 @@ store: ``helper_{endian}_st{size}_mmu(env, addr, val, opindex, retaddr)``
  - ``l`` : 32 bits
  - ``q`` : 64 bits
 
-``endian``
- - ``le`` : little endian
- - ``be`` : big endian
- - ``ret`` : target endianness
-
 Regexes for git grep
- - ``\<helper_\(le\|be\|ret\)_ld[us]\?[bwlq]_mmu\>``
- - ``\<helper_\(le\|be\|ret\)_st[bwlq]_mmu\>``
+ - ``\<helper_ld[us]\?[bwlq]_mmu\>``
+ - ``\<helper_st[bwlq]_mmu\>``
 
 ``address_space_*``
 ~~~~~~~~~~~~~~~~~~~
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index 62dd22d73c..e6636c1f8b 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -1587,39 +1587,6 @@ typedef struct {
 } HostAddress;
 
 #ifdef CONFIG_SOFTMMU
-/* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
- *                                     MemOpIdx oi, uintptr_t ra)
- */
-static void * const qemu_ld_helpers[MO_SIZE + 1] = {
-    [MO_8]  = helper_ret_ldub_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_16] = helper_be_lduw_mmu,
-    [MO_32] = helper_be_ldul_mmu,
-    [MO_64] = helper_be_ldq_mmu,
-#else
-    [MO_16] = helper_le_lduw_mmu,
-    [MO_32] = helper_le_ldul_mmu,
-    [MO_64] = helper_le_ldq_mmu,
-#endif
-};
-
-/* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
- *                                     uintxx_t val, MemOpIdx oi,
- *                                     uintptr_t ra)
- */
-static void * const qemu_st_helpers[MO_SIZE + 1] = {
-    [MO_8]  = helper_ret_stb_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_16] = helper_be_stw_mmu,
-    [MO_32] = helper_be_stl_mmu,
-    [MO_64] = helper_be_stq_mmu,
-#else
-    [MO_16] = helper_le_stw_mmu,
-    [MO_32] = helper_le_stl_mmu,
-    [MO_64] = helper_le_stq_mmu,
-#endif
-};
-
 static const TCGLdstHelperParam ldst_helper_param = {
     .ntmp = 1, .tmp = { TCG_REG_TMP }
 };
diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index df514e56fc..8b0d526659 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -1333,43 +1333,6 @@ typedef struct {
 } HostAddress;
 
 #ifdef CONFIG_SOFTMMU
-/* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
- *                                     int mmu_idx, uintptr_t ra)
- */
-static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
-    [MO_UB]   = helper_ret_ldub_mmu,
-    [MO_SB]   = helper_ret_ldsb_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_UW] = helper_be_lduw_mmu,
-    [MO_UL] = helper_be_ldul_mmu,
-    [MO_UQ] = helper_be_ldq_mmu,
-    [MO_SW] = helper_be_ldsw_mmu,
-    [MO_SL] = helper_be_ldul_mmu,
-#else
-    [MO_UW] = helper_le_lduw_mmu,
-    [MO_UL] = helper_le_ldul_mmu,
-    [MO_UQ] = helper_le_ldq_mmu,
-    [MO_SW] = helper_le_ldsw_mmu,
-    [MO_SL] = helper_le_ldul_mmu,
-#endif
-};
-
-/* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
- *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
- */
-static void * const qemu_st_helpers[MO_SIZE + 1] = {
-    [MO_8]   = helper_ret_stb_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_16] = helper_be_stw_mmu,
-    [MO_32] = helper_be_stl_mmu,
-    [MO_64] = helper_be_stq_mmu,
-#else
-    [MO_16] = helper_le_stw_mmu,
-    [MO_32] = helper_le_stl_mmu,
-    [MO_64] = helper_le_stq_mmu,
-#endif
-};
-
 static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
 {
     /* We arrive at the slow path via "BLNE", so R14 contains l->raddr. */
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index 7dbfcbd20f..bb603e7968 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -1776,32 +1776,6 @@ typedef struct {
 } HostAddress;
 
 #if defined(CONFIG_SOFTMMU)
-/* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
- *                                     int mmu_idx, uintptr_t ra)
- */
-static void * const qemu_ld_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = helper_ret_ldub_mmu,
-    [MO_LEUW] = helper_le_lduw_mmu,
-    [MO_LEUL] = helper_le_ldul_mmu,
-    [MO_LEUQ] = helper_le_ldq_mmu,
-    [MO_BEUW] = helper_be_lduw_mmu,
-    [MO_BEUL] = helper_be_ldul_mmu,
-    [MO_BEUQ] = helper_be_ldq_mmu,
-};
-
-/* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
- *                                     uintxx_t val, int mmu_idx, uintptr_t ra)
- */
-static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = helper_ret_stb_mmu,
-    [MO_LEUW] = helper_le_stw_mmu,
-    [MO_LEUL] = helper_le_stl_mmu,
-    [MO_LEUQ] = helper_le_stq_mmu,
-    [MO_BEUW] = helper_be_stw_mmu,
-    [MO_BEUL] = helper_be_stl_mmu,
-    [MO_BEUQ] = helper_be_stq_mmu,
-};
-
 /*
  * Because i686 has no register parameters and because x86_64 has xchg
  * to handle addr/data register overlap, we have placed all input arguments
@@ -1842,7 +1816,7 @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     }
 
     tcg_out_ld_helper_args(s, l, &ldst_helper_param);
-    tcg_out_branch(s, 1, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_branch(s, 1, qemu_ld_helpers[opc & MO_SIZE]);
     tcg_out_ld_helper_ret(s, l, false, &ldst_helper_param);
 
     tcg_out_jmp(s, l->raddr);
@@ -1864,7 +1838,7 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     }
 
     tcg_out_st_helper_args(s, l, &ldst_helper_param);
-    tcg_out_branch(s, 1, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_branch(s, 1, qemu_st_helpers[opc & MO_SIZE]);
 
     tcg_out_jmp(s, l->raddr);
     return true;
diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index 83fa45c802..d1bc29826f 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -784,29 +784,6 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
  */
 
 #if defined(CONFIG_SOFTMMU)
-/*
- * helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
- *                                     MemOpIdx oi, uintptr_t ra)
- */
-static void * const qemu_ld_helpers[4] = {
-    [MO_8]  = helper_ret_ldub_mmu,
-    [MO_16] = helper_le_lduw_mmu,
-    [MO_32] = helper_le_ldul_mmu,
-    [MO_64] = helper_le_ldq_mmu,
-};
-
-/*
- * helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
- *                                     uintxx_t val, MemOpIdx oi,
- *                                     uintptr_t ra)
- */
-static void * const qemu_st_helpers[4] = {
-    [MO_8]  = helper_ret_stb_mmu,
-    [MO_16] = helper_le_stw_mmu,
-    [MO_32] = helper_le_stl_mmu,
-    [MO_64] = helper_le_stq_mmu,
-};
-
 static bool tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
 {
     tcg_out_opc_b(s, 0);
diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index 5ad9867882..7770ef46bd 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -1076,37 +1076,6 @@ static void tcg_out_call(TCGContext *s, const tcg_insn_unit *arg,
 }
 
 #if defined(CONFIG_SOFTMMU)
-static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
-    [MO_UB]   = helper_ret_ldub_mmu,
-    [MO_SB]   = helper_ret_ldsb_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_UW] = helper_be_lduw_mmu,
-    [MO_SW] = helper_be_ldsw_mmu,
-    [MO_UL] = helper_be_ldul_mmu,
-    [MO_SL] = helper_be_ldsl_mmu,
-    [MO_UQ] = helper_be_ldq_mmu,
-#else
-    [MO_UW] = helper_le_lduw_mmu,
-    [MO_SW] = helper_le_ldsw_mmu,
-    [MO_UL] = helper_le_ldul_mmu,
-    [MO_UQ] = helper_le_ldq_mmu,
-    [MO_SL] = helper_le_ldsl_mmu,
-#endif
-};
-
-static void * const qemu_st_helpers[MO_SIZE + 1] = {
-    [MO_UB]   = helper_ret_stb_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_UW] = helper_be_stw_mmu,
-    [MO_UL] = helper_be_stl_mmu,
-    [MO_UQ] = helper_be_stq_mmu,
-#else
-    [MO_UW] = helper_le_stw_mmu,
-    [MO_UL] = helper_le_stl_mmu,
-    [MO_UQ] = helper_le_stq_mmu,
-#endif
-};
-
 /* We have four temps, we might as well expose three of them. */
 static const TCGLdstHelperParam ldst_helper_param = {
     .ntmp = 3, .tmp = { TCG_TMP0, TCG_TMP1, TCG_TMP2 }
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 0a14c3e997..0963156a78 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -1963,32 +1963,6 @@ static const uint32_t qemu_stx_opc[(MO_SIZE + MO_BSWAP) + 1] = {
 };
 
 #if defined (CONFIG_SOFTMMU)
-/* helper signature: helper_ld_mmu(CPUState *env, target_ulong addr,
- *                                 int mmu_idx, uintptr_t ra)
- */
-static void * const qemu_ld_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = helper_ret_ldub_mmu,
-    [MO_LEUW] = helper_le_lduw_mmu,
-    [MO_LEUL] = helper_le_ldul_mmu,
-    [MO_LEUQ] = helper_le_ldq_mmu,
-    [MO_BEUW] = helper_be_lduw_mmu,
-    [MO_BEUL] = helper_be_ldul_mmu,
-    [MO_BEUQ] = helper_be_ldq_mmu,
-};
-
-/* helper signature: helper_st_mmu(CPUState *env, target_ulong addr,
- *                                 uintxx_t val, int mmu_idx, uintptr_t ra)
- */
-static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = helper_ret_stb_mmu,
-    [MO_LEUW] = helper_le_stw_mmu,
-    [MO_LEUL] = helper_le_stl_mmu,
-    [MO_LEUQ] = helper_le_stq_mmu,
-    [MO_BEUW] = helper_be_stw_mmu,
-    [MO_BEUL] = helper_be_stl_mmu,
-    [MO_BEUQ] = helper_be_stq_mmu,
-};
-
 static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
 {
     if (arg < 0) {
@@ -2017,7 +1991,7 @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     }
 
     tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
-    tcg_out_call_int(s, LK, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_call_int(s, LK, qemu_ld_helpers[opc & MO_SIZE]);
     tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
 
     tcg_out_b(s, 0, lb->raddr);
@@ -2033,7 +2007,7 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     }
 
     tcg_out_st_helper_args(s, lb, &ldst_helper_param);
-    tcg_out_call_int(s, LK, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_call_int(s, LK, qemu_st_helpers[opc & MO_SIZE]);
 
     tcg_out_b(s, 0, lb->raddr);
     return true;
diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index d12b824d8c..8ed0e2f210 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -847,48 +847,6 @@ static void tcg_out_mb(TCGContext *s, TCGArg a0)
  */
 
 #if defined(CONFIG_SOFTMMU)
-/* helper signature: helper_ret_ld_mmu(CPUState *env, target_ulong addr,
- *                                     MemOpIdx oi, uintptr_t ra)
- */
-static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
-    [MO_UB] = helper_ret_ldub_mmu,
-    [MO_SB] = helper_ret_ldsb_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_UW] = helper_be_lduw_mmu,
-    [MO_SW] = helper_be_ldsw_mmu,
-    [MO_UL] = helper_be_ldul_mmu,
-#if TCG_TARGET_REG_BITS == 64
-    [MO_SL] = helper_be_ldsl_mmu,
-#endif
-    [MO_UQ] = helper_be_ldq_mmu,
-#else
-    [MO_UW] = helper_le_lduw_mmu,
-    [MO_SW] = helper_le_ldsw_mmu,
-    [MO_UL] = helper_le_ldul_mmu,
-#if TCG_TARGET_REG_BITS == 64
-    [MO_SL] = helper_le_ldsl_mmu,
-#endif
-    [MO_UQ] = helper_le_ldq_mmu,
-#endif
-};
-
-/* helper signature: helper_ret_st_mmu(CPUState *env, target_ulong addr,
- *                                     uintxx_t val, MemOpIdx oi,
- *                                     uintptr_t ra)
- */
-static void * const qemu_st_helpers[MO_SIZE + 1] = {
-    [MO_8]   = helper_ret_stb_mmu,
-#if HOST_BIG_ENDIAN
-    [MO_16] = helper_be_stw_mmu,
-    [MO_32] = helper_be_stl_mmu,
-    [MO_64] = helper_be_stq_mmu,
-#else
-    [MO_16] = helper_le_stw_mmu,
-    [MO_32] = helper_le_stl_mmu,
-    [MO_64] = helper_le_stq_mmu,
-#endif
-};
-
 static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
 {
     tcg_out_opc_jump(s, OPC_JAL, TCG_REG_ZERO, 0);
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index aacbaf21d5..968977be98 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -438,33 +438,6 @@ static const uint8_t tcg_cond_to_ltr_cond[] = {
     [TCG_COND_GEU] = S390_CC_ALWAYS,
 };
 
-#ifdef CONFIG_SOFTMMU
-static void * const qemu_ld_helpers[(MO_SSIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = helper_ret_ldub_mmu,
-    [MO_SB]   = helper_ret_ldsb_mmu,
-    [MO_LEUW] = helper_le_lduw_mmu,
-    [MO_LESW] = helper_le_ldsw_mmu,
-    [MO_LEUL] = helper_le_ldul_mmu,
-    [MO_LESL] = helper_le_ldsl_mmu,
-    [MO_LEUQ] = helper_le_ldq_mmu,
-    [MO_BEUW] = helper_be_lduw_mmu,
-    [MO_BESW] = helper_be_ldsw_mmu,
-    [MO_BEUL] = helper_be_ldul_mmu,
-    [MO_BESL] = helper_be_ldsl_mmu,
-    [MO_BEUQ] = helper_be_ldq_mmu,
-};
-
-static void * const qemu_st_helpers[(MO_SIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = helper_ret_stb_mmu,
-    [MO_LEUW] = helper_le_stw_mmu,
-    [MO_LEUL] = helper_le_stl_mmu,
-    [MO_LEUQ] = helper_le_stq_mmu,
-    [MO_BEUW] = helper_be_stw_mmu,
-    [MO_BEUL] = helper_be_stl_mmu,
-    [MO_BEUQ] = helper_be_stq_mmu,
-};
-#endif
-
 static const tcg_insn_unit *tb_ret_addr;
 uint64_t s390_facilities[3];
 
@@ -1721,7 +1694,7 @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     }
 
     tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
-    tcg_out_call_int(s, qemu_ld_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_call_int(s, qemu_ld_helpers[opc & MO_SIZE]);
     tcg_out_ld_helper_ret(s, lb, false, &ldst_helper_param);
 
     tgen_gotoi(s, S390_CC_ALWAYS, lb->raddr);
@@ -1738,7 +1711,7 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     }
 
     tcg_out_st_helper_args(s, lb, &ldst_helper_param);
-    tcg_out_call_int(s, qemu_st_helpers[opc & (MO_BSWAP | MO_SIZE)]);
+    tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE]);
 
     tgen_gotoi(s, S390_CC_ALWAYS, lb->raddr);
     return true;
diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index 7e6466d3b6..e997db2645 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -919,33 +919,11 @@ static void tcg_out_mb(TCGContext *s, TCGArg a0)
 }
 
 #ifdef CONFIG_SOFTMMU
-static const tcg_insn_unit *qemu_ld_trampoline[(MO_SSIZE | MO_BSWAP) + 1];
-static const tcg_insn_unit *qemu_st_trampoline[(MO_SIZE | MO_BSWAP) + 1];
+static const tcg_insn_unit *qemu_ld_trampoline[MO_SSIZE + 1];
+static const tcg_insn_unit *qemu_st_trampoline[MO_SIZE + 1];
 
 static void build_trampolines(TCGContext *s)
 {
-    static void * const qemu_ld_helpers[] = {
-        [MO_UB]   = helper_ret_ldub_mmu,
-        [MO_SB]   = helper_ret_ldsb_mmu,
-        [MO_LEUW] = helper_le_lduw_mmu,
-        [MO_LESW] = helper_le_ldsw_mmu,
-        [MO_LEUL] = helper_le_ldul_mmu,
-        [MO_LEUQ] = helper_le_ldq_mmu,
-        [MO_BEUW] = helper_be_lduw_mmu,
-        [MO_BESW] = helper_be_ldsw_mmu,
-        [MO_BEUL] = helper_be_ldul_mmu,
-        [MO_BEUQ] = helper_be_ldq_mmu,
-    };
-    static void * const qemu_st_helpers[] = {
-        [MO_UB]   = helper_ret_stb_mmu,
-        [MO_LEUW] = helper_le_stw_mmu,
-        [MO_LEUL] = helper_le_stl_mmu,
-        [MO_LEUQ] = helper_le_stq_mmu,
-        [MO_BEUW] = helper_be_stw_mmu,
-        [MO_BEUL] = helper_be_stl_mmu,
-        [MO_BEUQ] = helper_be_stq_mmu,
-    };
-
     int i;
 
     for (i = 0; i < ARRAY_SIZE(qemu_ld_helpers); ++i) {
@@ -1210,9 +1188,9 @@ static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
     /* We use the helpers to extend SB and SW data, leaving the case
        of SL needing explicit extending below.  */
     if ((memop & MO_SSIZE) == MO_SL) {
-        func = qemu_ld_trampoline[memop & (MO_BSWAP | MO_SIZE)];
+        func = qemu_ld_trampoline[MO_UL];
     } else {
-        func = qemu_ld_trampoline[memop & (MO_BSWAP | MO_SSIZE)];
+        func = qemu_ld_trampoline[memop & MO_SSIZE];
     }
     tcg_debug_assert(func != NULL);
     tcg_out_call_nodelay(s, func, false);
@@ -1353,7 +1331,7 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
     tcg_out_movext(s, (memop & MO_SIZE) == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32,
                    TCG_REG_O2, data_type, memop & MO_SIZE, data);
 
-    func = qemu_st_trampoline[memop & (MO_BSWAP | MO_SIZE)];
+    func = qemu_st_trampoline[memop & MO_SIZE];
     tcg_debug_assert(func != NULL);
     tcg_out_call_nodelay(s, func, false);
     /* delay slot */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 10/57] accel/tcg: Implement helper_{ld, st}*_mmu for user-only
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (8 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 09/57] tcg: Unify helper_{be,le}_{ld,st}* Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05  9:43   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 11/57] tcg/tci: Use helper_{ld,st}*_mmu " Richard Henderson
                   ` (47 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

TCG backends may need to defer to a helper to implement
the atomicity required by a given operation.  Mirror the
interface used in system mode.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/tcg/tcg-ldst.h |   6 +-
 accel/tcg/user-exec.c  | 393 ++++++++++++++++++++++++++++-------------
 tcg/tcg.c              |   6 +-
 3 files changed, 278 insertions(+), 127 deletions(-)

diff --git a/include/tcg/tcg-ldst.h b/include/tcg/tcg-ldst.h
index 3d897ca942..57fafa14b1 100644
--- a/include/tcg/tcg-ldst.h
+++ b/include/tcg/tcg-ldst.h
@@ -25,8 +25,6 @@
 #ifndef TCG_LDST_H
 #define TCG_LDST_H
 
-#ifdef CONFIG_SOFTMMU
-
 /* Value zero-extended to tcg register size.  */
 tcg_target_ulong helper_ldub_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr);
@@ -58,10 +56,10 @@ void helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
 void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr);
 
-#else
+#ifdef CONFIG_USER_ONLY
 
 G_NORETURN void helper_unaligned_ld(CPUArchState *env, target_ulong addr);
 G_NORETURN void helper_unaligned_st(CPUArchState *env, target_ulong addr);
 
-#endif /* CONFIG_SOFTMMU */
+#endif /* CONFIG_USER_ONLY */
 #endif /* TCG_LDST_H */
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index b89fa35a83..d9f9766b7f 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -889,21 +889,6 @@ void page_reset_target_data(target_ulong start, target_ulong last) { }
 
 /* The softmmu versions of these helpers are in cputlb.c.  */
 
-/*
- * Verify that we have passed the correct MemOp to the correct function.
- *
- * We could present one function to target code, and dispatch based on
- * the MemOp, but so far we have worked hard to avoid an indirect function
- * call along the memory path.
- */
-static void validate_memop(MemOpIdx oi, MemOp expected)
-{
-#ifdef CONFIG_DEBUG_TCG
-    MemOp have = get_memop(oi) & (MO_SIZE | MO_BSWAP);
-    assert(have == expected);
-#endif
-}
-
 void helper_unaligned_ld(CPUArchState *env, target_ulong addr)
 {
     cpu_loop_exit_sigbus(env_cpu(env), addr, MMU_DATA_LOAD, GETPC());
@@ -914,10 +899,9 @@ void helper_unaligned_st(CPUArchState *env, target_ulong addr)
     cpu_loop_exit_sigbus(env_cpu(env), addr, MMU_DATA_STORE, GETPC());
 }
 
-static void *cpu_mmu_lookup(CPUArchState *env, target_ulong addr,
-                            MemOpIdx oi, uintptr_t ra, MMUAccessType type)
+static void *cpu_mmu_lookup(CPUArchState *env, abi_ptr addr,
+                            MemOp mop, uintptr_t ra, MMUAccessType type)
 {
-    MemOp mop = get_memop(oi);
     int a_bits = get_alignment_bits(mop);
     void *ret;
 
@@ -933,100 +917,206 @@ static void *cpu_mmu_lookup(CPUArchState *env, target_ulong addr,
 
 #include "ldst_atomicity.c.inc"
 
-uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr,
-                    MemOpIdx oi, uintptr_t ra)
+static uint8_t do_ld1_mmu(CPUArchState *env, abi_ptr addr,
+                          MemOp mop, uintptr_t ra)
 {
     void *haddr;
     uint8_t ret;
 
-    validate_memop(oi, MO_UB);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
+    tcg_debug_assert((mop & MO_SIZE) == MO_8);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_LOAD);
     ret = ldub_p(haddr);
     clear_helper_retaddr();
+    return ret;
+}
+
+tcg_target_ulong helper_ldub_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t ra)
+{
+    return do_ld1_mmu(env, addr, get_memop(oi), ra);
+}
+
+tcg_target_ulong helper_ldsb_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t ra)
+{
+    return (int8_t)do_ld1_mmu(env, addr, get_memop(oi), ra);
+}
+
+uint8_t cpu_ldb_mmu(CPUArchState *env, abi_ptr addr,
+                    MemOpIdx oi, uintptr_t ra)
+{
+    uint8_t ret = do_ld1_mmu(env, addr, get_memop(oi), ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     return ret;
 }
 
+static uint16_t do_ld2_he_mmu(CPUArchState *env, abi_ptr addr,
+                              MemOp mop, uintptr_t ra)
+{
+    void *haddr;
+    uint16_t ret;
+
+    tcg_debug_assert((mop & MO_SIZE) == MO_16);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_LOAD);
+    ret = load_atom_2(env, ra, haddr, mop);
+    clear_helper_retaddr();
+    return ret;
+}
+
+tcg_target_ulong helper_lduw_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    uint16_t ret = do_ld2_he_mmu(env, addr, mop, ra);
+
+    if (mop & MO_BSWAP) {
+        ret = bswap16(ret);
+    }
+    return ret;
+}
+
+tcg_target_ulong helper_ldsw_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    int16_t ret = do_ld2_he_mmu(env, addr, mop, ra);
+
+    if (mop & MO_BSWAP) {
+        ret = bswap16(ret);
+    }
+    return ret;
+}
+
 uint16_t cpu_ldw_be_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
     uint16_t ret;
 
-    validate_memop(oi, MO_BEUW);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = load_atom_2(env, ra, haddr, get_memop(oi));
-    clear_helper_retaddr();
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
+    ret = do_ld2_he_mmu(env, addr, mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     return cpu_to_be16(ret);
 }
 
-uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
-                        MemOpIdx oi, uintptr_t ra)
-{
-    void *haddr;
-    uint32_t ret;
-
-    validate_memop(oi, MO_BEUL);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = load_atom_4(env, ra, haddr, get_memop(oi));
-    clear_helper_retaddr();
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return cpu_to_be32(ret);
-}
-
-uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
-                        MemOpIdx oi, uintptr_t ra)
-{
-    void *haddr;
-    uint64_t ret;
-
-    validate_memop(oi, MO_BEUQ);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = load_atom_8(env, ra, haddr, get_memop(oi));
-    clear_helper_retaddr();
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return cpu_to_be64(ret);
-}
-
 uint16_t cpu_ldw_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
     uint16_t ret;
 
-    validate_memop(oi, MO_LEUW);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = load_atom_2(env, ra, haddr, get_memop(oi));
-    clear_helper_retaddr();
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
+    ret = do_ld2_he_mmu(env, addr, mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     return cpu_to_le16(ret);
 }
 
+static uint32_t do_ld4_he_mmu(CPUArchState *env, abi_ptr addr,
+                              MemOp mop, uintptr_t ra)
+{
+    void *haddr;
+    uint32_t ret;
+
+    tcg_debug_assert((mop & MO_SIZE) == MO_32);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_LOAD);
+    ret = load_atom_4(env, ra, haddr, mop);
+    clear_helper_retaddr();
+    return ret;
+}
+
+tcg_target_ulong helper_ldul_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    uint32_t ret = do_ld4_he_mmu(env, addr, mop, ra);
+
+    if (mop & MO_BSWAP) {
+        ret = bswap32(ret);
+    }
+    return ret;
+}
+
+tcg_target_ulong helper_ldsl_mmu(CPUArchState *env, target_ulong addr,
+                                 MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    int32_t ret = do_ld4_he_mmu(env, addr, mop, ra);
+
+    if (mop & MO_BSWAP) {
+        ret = bswap32(ret);
+    }
+    return ret;
+}
+
+uint32_t cpu_ldl_be_mmu(CPUArchState *env, abi_ptr addr,
+                        MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    uint32_t ret;
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
+    ret = do_ld4_he_mmu(env, addr, mop, ra);
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+    return cpu_to_be32(ret);
+}
+
 uint32_t cpu_ldl_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
     uint32_t ret;
 
-    validate_memop(oi, MO_LEUL);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = load_atom_4(env, ra, haddr, get_memop(oi));
-    clear_helper_retaddr();
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
+    ret = do_ld4_he_mmu(env, addr, mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     return cpu_to_le32(ret);
 }
 
+static uint64_t do_ld8_he_mmu(CPUArchState *env, abi_ptr addr,
+                              MemOp mop, uintptr_t ra)
+{
+    void *haddr;
+    uint64_t ret;
+
+    tcg_debug_assert((mop & MO_SIZE) == MO_64);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_LOAD);
+    ret = load_atom_8(env, ra, haddr, mop);
+    clear_helper_retaddr();
+    return ret;
+}
+
+uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr,
+                        MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    uint64_t ret = do_ld8_he_mmu(env, addr, mop, ra);
+
+    if (mop & MO_BSWAP) {
+        ret = bswap64(ret);
+    }
+    return ret;
+}
+
+uint64_t cpu_ldq_be_mmu(CPUArchState *env, abi_ptr addr,
+                        MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    uint64_t ret;
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
+    ret = do_ld8_he_mmu(env, addr, mop, ra);
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+    return cpu_to_be64(ret);
+}
+
 uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
                         MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
     uint64_t ret;
 
-    validate_memop(oi, MO_LEUQ);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    ret = load_atom_8(env, ra, haddr, get_memop(oi));
-    clear_helper_retaddr();
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
+    ret = do_ld8_he_mmu(env, addr, mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     return cpu_to_le64(ret);
 }
@@ -1037,7 +1127,7 @@ Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
     void *haddr;
     Int128 ret;
 
-    validate_memop(oi, MO_128 | MO_BE);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_BE));
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
     memcpy(&ret, haddr, 16);
     clear_helper_retaddr();
@@ -1055,7 +1145,7 @@ Int128 cpu_ld16_le_mmu(CPUArchState *env, abi_ptr addr,
     void *haddr;
     Int128 ret;
 
-    validate_memop(oi, MO_128 | MO_LE);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_LE));
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
     memcpy(&ret, haddr, 16);
     clear_helper_retaddr();
@@ -1067,87 +1157,153 @@ Int128 cpu_ld16_le_mmu(CPUArchState *env, abi_ptr addr,
     return ret;
 }
 
-void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
-                 MemOpIdx oi, uintptr_t ra)
+static void do_st1_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
+                       MemOp mop, uintptr_t ra)
 {
     void *haddr;
 
-    validate_memop(oi, MO_UB);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
+    tcg_debug_assert((mop & MO_SIZE) == MO_8);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_STORE);
     stb_p(haddr, val);
     clear_helper_retaddr();
+}
+
+void helper_stb_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t ra)
+{
+    do_st1_mmu(env, addr, val, get_memop(oi), ra);
+}
+
+void cpu_stb_mmu(CPUArchState *env, abi_ptr addr, uint8_t val,
+                 MemOpIdx oi, uintptr_t ra)
+{
+    do_st1_mmu(env, addr, val, get_memop(oi), ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
+static void do_st2_he_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
+                          MemOp mop, uintptr_t ra)
+{
+    void *haddr;
+
+    tcg_debug_assert((mop & MO_SIZE) == MO_16);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_STORE);
+    store_atom_2(env, ra, haddr, mop, val);
+    clear_helper_retaddr();
+}
+
+void helper_stw_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    if (mop & MO_BSWAP) {
+        val = bswap16(val);
+    }
+    do_st2_he_mmu(env, addr, val, mop, ra);
+}
+
 void cpu_stw_be_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
                     MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
 
-    validate_memop(oi, MO_BEUW);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    store_atom_2(env, ra, haddr, get_memop(oi), be16_to_cpu(val));
-    clear_helper_retaddr();
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
-}
-
-void cpu_stl_be_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
-                    MemOpIdx oi, uintptr_t ra)
-{
-    void *haddr;
-
-    validate_memop(oi, MO_BEUL);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    store_atom_4(env, ra, haddr, get_memop(oi), be32_to_cpu(val));
-    clear_helper_retaddr();
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
-}
-
-void cpu_stq_be_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
-                    MemOpIdx oi, uintptr_t ra)
-{
-    void *haddr;
-
-    validate_memop(oi, MO_BEUQ);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    store_atom_8(env, ra, haddr, get_memop(oi), be64_to_cpu(val));
-    clear_helper_retaddr();
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
+    do_st2_he_mmu(env, addr, be16_to_cpu(val), mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
 void cpu_stw_le_mmu(CPUArchState *env, abi_ptr addr, uint16_t val,
                     MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
+    do_st2_he_mmu(env, addr, le16_to_cpu(val), mop, ra);
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+}
+
+static void do_st4_he_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
+                          MemOp mop, uintptr_t ra)
 {
     void *haddr;
 
-    validate_memop(oi, MO_LEUW);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    store_atom_2(env, ra, haddr, get_memop(oi), le16_to_cpu(val));
+    tcg_debug_assert((mop & MO_SIZE) == MO_32);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_STORE);
+    store_atom_4(env, ra, haddr, mop, val);
     clear_helper_retaddr();
+}
+
+void helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    if (mop & MO_BSWAP) {
+        val = bswap32(val);
+    }
+    do_st4_he_mmu(env, addr, val, mop, ra);
+}
+
+void cpu_stl_be_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
+                    MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
+    do_st4_he_mmu(env, addr, be32_to_cpu(val), mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
 void cpu_stl_le_mmu(CPUArchState *env, abi_ptr addr, uint32_t val,
                     MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
+    do_st4_he_mmu(env, addr, le32_to_cpu(val), mop, ra);
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+}
+
+static void do_st8_he_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
+                          MemOp mop, uintptr_t ra)
 {
     void *haddr;
 
-    validate_memop(oi, MO_LEUL);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    store_atom_4(env, ra, haddr, get_memop(oi), le32_to_cpu(val));
+    tcg_debug_assert((mop & MO_SIZE) == MO_64);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_STORE);
+    store_atom_8(env, ra, haddr, mop, val);
     clear_helper_retaddr();
+}
+
+void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
+                    MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    if (mop & MO_BSWAP) {
+        val = bswap64(val);
+    }
+    do_st8_he_mmu(env, addr, val, mop, ra);
+}
+
+void cpu_stq_be_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
+                    MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
+    do_st8_he_mmu(env, addr, cpu_to_be64(val), mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
 void cpu_stq_le_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
                     MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
 
-    validate_memop(oi, MO_LEUQ);
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
-    store_atom_8(env, ra, haddr, get_memop(oi), le64_to_cpu(val));
-    clear_helper_retaddr();
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
+    do_st8_he_mmu(env, addr, cpu_to_le64(val), mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
@@ -1156,7 +1312,7 @@ void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr,
 {
     void *haddr;
 
-    validate_memop(oi, MO_128 | MO_BE);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_BE));
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
     if (!HOST_BIG_ENDIAN) {
         val = bswap128(val);
@@ -1171,7 +1327,7 @@ void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr,
 {
     void *haddr;
 
-    validate_memop(oi, MO_128 | MO_LE);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_LE));
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
     if (HOST_BIG_ENDIAN) {
         val = bswap128(val);
@@ -1269,7 +1425,6 @@ uint64_t cpu_ldq_code_mmu(CPUArchState *env, abi_ptr addr,
     void *haddr;
     uint64_t ret;
 
-    validate_memop(oi, MO_BEUQ);
     haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
     ret = ldq_p(haddr);
     clear_helper_retaddr();
diff --git a/tcg/tcg.c b/tcg/tcg.c
index 12510c78c6..d0afabf194 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -197,8 +197,7 @@ static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *l,
                                    const TCGLdstHelperParam *p)
     __attribute__((unused));
 
-#ifdef CONFIG_SOFTMMU
-static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
+static void * const qemu_ld_helpers[MO_SSIZE + 1] __attribute__((unused)) = {
     [MO_UB] = helper_ldub_mmu,
     [MO_SB] = helper_ldsb_mmu,
     [MO_UW] = helper_lduw_mmu,
@@ -210,13 +209,12 @@ static void * const qemu_ld_helpers[MO_SSIZE + 1] = {
 #endif
 };
 
-static void * const qemu_st_helpers[MO_SIZE + 1] = {
+static void * const qemu_st_helpers[MO_SIZE + 1] __attribute__((unused)) = {
     [MO_8]  = helper_stb_mmu,
     [MO_16] = helper_stw_mmu,
     [MO_32] = helper_stl_mmu,
     [MO_64] = helper_stq_mmu,
 };
-#endif
 
 TCGContext tcg_init_ctx;
 __thread TCGContext *tcg_ctx;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 11/57] tcg/tci: Use helper_{ld,st}*_mmu for user-only
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (9 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 10/57] accel/tcg: Implement helper_{ld, st}*_mmu for user-only Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05  9:44   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 12/57] tcg: Add 128-bit guest memory primitives Richard Henderson
                   ` (46 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

We can now fold these two pieces of code.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tci.c | 89 -------------------------------------------------------
 1 file changed, 89 deletions(-)

diff --git a/tcg/tci.c b/tcg/tci.c
index 5bde2e1f2e..15f2f8c463 100644
--- a/tcg/tci.c
+++ b/tcg/tci.c
@@ -292,7 +292,6 @@ static uint64_t tci_qemu_ld(CPUArchState *env, target_ulong taddr,
     MemOp mop = get_memop(oi);
     uintptr_t ra = (uintptr_t)tb_ptr;
 
-#ifdef CONFIG_SOFTMMU
     switch (mop & MO_SSIZE) {
     case MO_UB:
         return helper_ldub_mmu(env, taddr, oi, ra);
@@ -311,58 +310,6 @@ static uint64_t tci_qemu_ld(CPUArchState *env, target_ulong taddr,
     default:
         g_assert_not_reached();
     }
-#else
-    void *haddr = g2h(env_cpu(env), taddr);
-    unsigned a_mask = (1u << get_alignment_bits(mop)) - 1;
-    uint64_t ret;
-
-    set_helper_retaddr(ra);
-    if (taddr & a_mask) {
-        helper_unaligned_ld(env, taddr);
-    }
-    switch (mop & (MO_BSWAP | MO_SSIZE)) {
-    case MO_UB:
-        ret = ldub_p(haddr);
-        break;
-    case MO_SB:
-        ret = ldsb_p(haddr);
-        break;
-    case MO_LEUW:
-        ret = lduw_le_p(haddr);
-        break;
-    case MO_LESW:
-        ret = ldsw_le_p(haddr);
-        break;
-    case MO_LEUL:
-        ret = (uint32_t)ldl_le_p(haddr);
-        break;
-    case MO_LESL:
-        ret = (int32_t)ldl_le_p(haddr);
-        break;
-    case MO_LEUQ:
-        ret = ldq_le_p(haddr);
-        break;
-    case MO_BEUW:
-        ret = lduw_be_p(haddr);
-        break;
-    case MO_BESW:
-        ret = ldsw_be_p(haddr);
-        break;
-    case MO_BEUL:
-        ret = (uint32_t)ldl_be_p(haddr);
-        break;
-    case MO_BESL:
-        ret = (int32_t)ldl_be_p(haddr);
-        break;
-    case MO_BEUQ:
-        ret = ldq_be_p(haddr);
-        break;
-    default:
-        g_assert_not_reached();
-    }
-    clear_helper_retaddr();
-    return ret;
-#endif
 }
 
 static void tci_qemu_st(CPUArchState *env, target_ulong taddr, uint64_t val,
@@ -371,7 +318,6 @@ static void tci_qemu_st(CPUArchState *env, target_ulong taddr, uint64_t val,
     MemOp mop = get_memop(oi);
     uintptr_t ra = (uintptr_t)tb_ptr;
 
-#ifdef CONFIG_SOFTMMU
     switch (mop & MO_SIZE) {
     case MO_UB:
         helper_stb_mmu(env, taddr, val, oi, ra);
@@ -388,41 +334,6 @@ static void tci_qemu_st(CPUArchState *env, target_ulong taddr, uint64_t val,
     default:
         g_assert_not_reached();
     }
-#else
-    void *haddr = g2h(env_cpu(env), taddr);
-    unsigned a_mask = (1u << get_alignment_bits(mop)) - 1;
-
-    set_helper_retaddr(ra);
-    if (taddr & a_mask) {
-        helper_unaligned_st(env, taddr);
-    }
-    switch (mop & (MO_BSWAP | MO_SIZE)) {
-    case MO_UB:
-        stb_p(haddr, val);
-        break;
-    case MO_LEUW:
-        stw_le_p(haddr, val);
-        break;
-    case MO_LEUL:
-        stl_le_p(haddr, val);
-        break;
-    case MO_LEUQ:
-        stq_le_p(haddr, val);
-        break;
-    case MO_BEUW:
-        stw_be_p(haddr, val);
-        break;
-    case MO_BEUL:
-        stl_be_p(haddr, val);
-        break;
-    case MO_BEUQ:
-        stq_be_p(haddr, val);
-        break;
-    default:
-        g_assert_not_reached();
-    }
-    clear_helper_retaddr();
-#endif
 }
 
 #if TCG_TARGET_REG_BITS == 64
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 12/57] tcg: Add 128-bit guest memory primitives
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (10 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 11/57] tcg/tci: Use helper_{ld,st}*_mmu " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:04   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 13/57] meson: Detect atomic128 support with optimization Richard Henderson
                   ` (45 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/tcg-runtime.h        |   3 +
 include/tcg/tcg-ldst.h         |   4 +
 accel/tcg/cputlb.c             | 392 +++++++++++++++++++++++++--------
 accel/tcg/user-exec.c          |  94 ++++++--
 tcg/tcg-op.c                   | 184 +++++++++++-----
 accel/tcg/ldst_atomicity.c.inc | 189 ++++++++++++++++
 6 files changed, 688 insertions(+), 178 deletions(-)

diff --git a/accel/tcg/tcg-runtime.h b/accel/tcg/tcg-runtime.h
index b8e6421c8a..d9adc646c1 100644
--- a/accel/tcg/tcg-runtime.h
+++ b/accel/tcg/tcg-runtime.h
@@ -39,6 +39,9 @@ DEF_HELPER_FLAGS_1(exit_atomic, TCG_CALL_NO_WG, noreturn, env)
 DEF_HELPER_FLAGS_3(memset, TCG_CALL_NO_RWG, ptr, ptr, int, ptr)
 #endif /* IN_HELPER_PROTO */
 
+DEF_HELPER_FLAGS_3(ld_i128, TCG_CALL_NO_WG, i128, env, tl, i32)
+DEF_HELPER_FLAGS_4(st_i128, TCG_CALL_NO_WG, void, env, tl, i128, i32)
+
 DEF_HELPER_FLAGS_5(atomic_cmpxchgb, TCG_CALL_NO_WG,
                    i32, env, tl, i32, i32, i32)
 DEF_HELPER_FLAGS_5(atomic_cmpxchgw_be, TCG_CALL_NO_WG,
diff --git a/include/tcg/tcg-ldst.h b/include/tcg/tcg-ldst.h
index 57fafa14b1..64f48e6990 100644
--- a/include/tcg/tcg-ldst.h
+++ b/include/tcg/tcg-ldst.h
@@ -34,6 +34,8 @@ tcg_target_ulong helper_ldul_mmu(CPUArchState *env, target_ulong addr,
                                  MemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_ldq_mmu(CPUArchState *env, target_ulong addr,
                         MemOpIdx oi, uintptr_t retaddr);
+Int128 helper_ld16_mmu(CPUArchState *env, target_ulong addr,
+                       MemOpIdx oi, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -55,6 +57,8 @@ void helper_stl_mmu(CPUArchState *env, target_ulong addr, uint32_t val,
                     MemOpIdx oi, uintptr_t retaddr);
 void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
                     MemOpIdx oi, uintptr_t retaddr);
+void helper_st16_mmu(CPUArchState *env, target_ulong addr, Int128 val,
+                     MemOpIdx oi, uintptr_t retaddr);
 
 #ifdef CONFIG_USER_ONLY
 
diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c
index 566cf8311b..a77b439df8 100644
--- a/accel/tcg/cputlb.c
+++ b/accel/tcg/cputlb.c
@@ -40,6 +40,7 @@
 #include "qemu/plugin-memory.h"
 #endif
 #include "tcg/tcg-ldst.h"
+#include "exec/helper-proto.h"
 
 /* DEBUG defines, enable DEBUG_TLB_LOG to log to the CPU_LOG_MMU target */
 /* #define DEBUG_TLB */
@@ -2161,6 +2162,31 @@ static uint64_t do_ld_whole_be8(CPUArchState *env, uintptr_t ra,
     return (ret_be << (p->size * 8)) | x;
 }
 
+/**
+ * do_ld_parts_be16
+ * @p: translation parameters
+ * @ret_be: accumulated data
+ *
+ * As do_ld_bytes_beN, but with one atomic load.
+ * 16 aligned bytes are guaranteed to cover the load.
+ */
+static Int128 do_ld_whole_be16(CPUArchState *env, uintptr_t ra,
+                               MMULookupPageData *p, uint64_t ret_be)
+{
+    int o = p->addr & 15;
+    Int128 x, y = load_atomic16_or_exit(env, ra, p->haddr - o);
+    int size = p->size;
+
+    if (!HOST_BIG_ENDIAN) {
+        y = bswap128(y);
+    }
+    y = int128_lshift(y, o * 8);
+    y = int128_urshift(y, (16 - size) * 8);
+    x = int128_make64(ret_be);
+    x = int128_lshift(x, size * 8);
+    return int128_or(x, y);
+}
+
 /*
  * Wrapper for the above.
  */
@@ -2205,6 +2231,59 @@ static uint64_t do_ld_beN(CPUArchState *env, MMULookupPageData *p,
     }
 }
 
+/*
+ * Wrapper for the above, for 8 < size < 16.
+ */
+static Int128 do_ld16_beN(CPUArchState *env, MMULookupPageData *p,
+                          uint64_t a, int mmu_idx, MemOp mop, uintptr_t ra)
+{
+    int size = p->size;
+    uint64_t b;
+    MemOp atmax;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        p->size = size - 8;
+        a = do_ld_mmio_beN(env, p, a, mmu_idx, MMU_DATA_LOAD, ra);
+        p->addr += p->size;
+        p->size = 8;
+        b = do_ld_mmio_beN(env, p, 0, mmu_idx, MMU_DATA_LOAD, ra);
+    } else {
+        switch (mop & MO_ATOM_MASK) {
+        case MO_ATOM_WITHIN16:
+            /*
+             * It is a given that we cross a page and therefore there is no
+             * atomicity for the load as a whole, but there may be a subobject
+             * as defined by ATMAX which does not cross a 16-byte boundary.
+             */
+            atmax = mop & MO_ATMAX_MASK;
+            if (atmax != MO_ATMAX_SIZE) {
+                atmax >>= MO_ATMAX_SHIFT;
+                if (unlikely(size >= (1 << atmax))) {
+                    return do_ld_whole_be16(env, ra, p, a);
+                }
+            }
+            /* fall through */
+        case MO_ATOM_IFALIGN:
+        case MO_ATOM_NONE:
+            p->size = size - 8;
+            a = do_ld_bytes_beN(p, a);
+            b = ldq_be_p(p->haddr + size - 8);
+            break;
+        case MO_ATOM_SUBALIGN:
+            p->size = size - 8;
+            a = do_ld_parts_beN(p, a);
+            p->haddr += size - 8;
+            p->size = 8;
+            b = do_ld_parts_beN(p, 0);
+            break;
+        default:
+            g_assert_not_reached();
+        }
+    }
+
+    return int128_make128(b, a);
+}
+
 static uint8_t do_ld_1(CPUArchState *env, MMULookupPageData *p, int mmu_idx,
                        MMUAccessType type, uintptr_t ra)
 {
@@ -2393,6 +2472,80 @@ tcg_target_ulong helper_ldsl_mmu(CPUArchState *env, target_ulong addr,
     return (int32_t)helper_ldul_mmu(env, addr, oi, retaddr);
 }
 
+static Int128 do_ld16_mmu(CPUArchState *env, target_ulong addr,
+                          MemOpIdx oi, uintptr_t ra)
+{
+    MMULookupLocals l;
+    bool crosspage;
+    uint64_t a, b;
+    Int128 ret;
+    int first;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD, &l);
+    if (likely(!crosspage)) {
+        /* Perform the load host endian. */
+        if (unlikely(l.page[0].flags & TLB_MMIO)) {
+            QEMU_IOTHREAD_LOCK_GUARD();
+            a = io_readx(env, l.page[0].full, l.mmu_idx, addr,
+                         ra, MMU_DATA_LOAD, MO_64);
+            b = io_readx(env, l.page[0].full, l.mmu_idx, addr + 8,
+                         ra, MMU_DATA_LOAD, MO_64);
+            ret = int128_make128(HOST_BIG_ENDIAN ? b : a,
+                                 HOST_BIG_ENDIAN ? a : b);
+        } else {
+            ret = load_atom_16(env, ra, l.page[0].haddr, l.memop);
+        }
+        if (l.memop & MO_BSWAP) {
+            ret = bswap128(ret);
+        }
+        return ret;
+    }
+
+    first = l.page[0].size;
+    if (first == 8) {
+        MemOp mop8 = (l.memop & ~MO_SIZE) | MO_64;
+
+        a = do_ld_8(env, &l.page[0], l.mmu_idx, MMU_DATA_LOAD, mop8, ra);
+        b = do_ld_8(env, &l.page[1], l.mmu_idx, MMU_DATA_LOAD, mop8, ra);
+        if ((mop8 & MO_BSWAP) == MO_LE) {
+            ret = int128_make128(a, b);
+        } else {
+            ret = int128_make128(b, a);
+        }
+        return ret;
+    }
+
+    if (first < 8) {
+        a = do_ld_beN(env, &l.page[0], 0, l.mmu_idx,
+                      MMU_DATA_LOAD, l.memop, ra);
+        ret = do_ld16_beN(env, &l.page[1], a, l.mmu_idx, l.memop, ra);
+    } else {
+        ret = do_ld16_beN(env, &l.page[0], 0, l.mmu_idx, l.memop, ra);
+        b = int128_getlo(ret);
+        ret = int128_lshift(ret, l.page[1].size * 8);
+        a = int128_gethi(ret);
+        b = do_ld_beN(env, &l.page[1], b, l.mmu_idx,
+                      MMU_DATA_LOAD, l.memop, ra);
+        ret = int128_make128(b, a);
+    }
+    if ((l.memop & MO_BSWAP) == MO_LE) {
+        ret = bswap128(ret);
+    }
+    return ret;
+}
+
+Int128 helper_ld16_mmu(CPUArchState *env, target_ulong addr,
+                       uint32_t oi, uintptr_t retaddr)
+{
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
+    return do_ld16_mmu(env, addr, oi, retaddr);
+}
+
+Int128 helper_ld_i128(CPUArchState *env, target_ulong addr, uint32_t oi)
+{
+    return helper_ld16_mmu(env, addr, oi, GETPC());
+}
+
 /*
  * Load helpers for cpu_ldst.h.
  */
@@ -2481,59 +2634,23 @@ uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
 Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
                        MemOpIdx oi, uintptr_t ra)
 {
-    MemOp mop = get_memop(oi);
-    int mmu_idx = get_mmuidx(oi);
-    MemOpIdx new_oi;
-    unsigned a_bits;
-    uint64_t h, l;
+    Int128 ret;
 
-    tcg_debug_assert((mop & (MO_BSWAP|MO_SSIZE)) == (MO_BE|MO_128));
-    a_bits = get_alignment_bits(mop);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_LOAD,
-                             mmu_idx, ra);
-    }
-
-    /* Construct an unaligned 64-bit replacement MemOpIdx. */
-    mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
-    new_oi = make_memop_idx(mop, mmu_idx);
-
-    h = helper_ldq_mmu(env, addr, new_oi, ra);
-    l = helper_ldq_mmu(env, addr + 8, new_oi, ra);
-
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return int128_make128(l, h);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP|MO_SIZE)) == (MO_BE|MO_128));
+    ret = do_ld16_mmu(env, addr, oi, ra);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 Int128 cpu_ld16_le_mmu(CPUArchState *env, abi_ptr addr,
                        MemOpIdx oi, uintptr_t ra)
 {
-    MemOp mop = get_memop(oi);
-    int mmu_idx = get_mmuidx(oi);
-    MemOpIdx new_oi;
-    unsigned a_bits;
-    uint64_t h, l;
+    Int128 ret;
 
-    tcg_debug_assert((mop & (MO_BSWAP|MO_SSIZE)) == (MO_LE|MO_128));
-    a_bits = get_alignment_bits(mop);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_LOAD,
-                             mmu_idx, ra);
-    }
-
-    /* Construct an unaligned 64-bit replacement MemOpIdx. */
-    mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
-    new_oi = make_memop_idx(mop, mmu_idx);
-
-    l = helper_ldq_mmu(env, addr, new_oi, ra);
-    h = helper_ldq_mmu(env, addr + 8, new_oi, ra);
-
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-    return int128_make128(l, h);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP|MO_SIZE)) == (MO_LE|MO_128));
+    ret = do_ld16_mmu(env, addr, oi, ra);
+    plugin_load_cb(env, addr, oi);
+    return ret;
 }
 
 /*
@@ -2614,6 +2731,57 @@ static uint64_t do_st_leN(CPUArchState *env, MMULookupPageData *p,
     }
 }
 
+/*
+ * Wrapper for the above, for 8 < size < 16.
+ */
+static uint64_t do_st16_leN(CPUArchState *env, MMULookupPageData *p,
+                            Int128 val_le, int mmu_idx,
+                            MemOp mop, uintptr_t ra)
+{
+    int size = p->size;
+    MemOp atmax;
+
+    if (unlikely(p->flags & TLB_MMIO)) {
+        p->size = 8;
+        do_st_mmio_leN(env, p, int128_getlo(val_le), mmu_idx, ra);
+        p->size = size - 8;
+        p->addr += 8;
+        return do_st_mmio_leN(env, p, int128_gethi(val_le), mmu_idx, ra);
+    } else if (unlikely(p->flags & TLB_DISCARD_WRITE)) {
+        return int128_gethi(val_le) >> ((size - 8) * 8);
+    }
+
+    switch (mop & MO_ATOM_MASK) {
+    case MO_ATOM_WITHIN16:
+        /*
+         * It is a given that we cross a page and therefore there is no
+         * atomicity for the store as a whole, but there may be a subobject
+         * as defined by ATMAX which does not cross a 16-byte boundary.
+         */
+        atmax = mop & MO_ATMAX_MASK;
+        if (atmax != MO_ATMAX_SIZE) {
+            atmax >>= MO_ATMAX_SHIFT;
+            if (unlikely(size >= (1 << atmax))) {
+                if (HAVE_al16) {
+                    return store_whole_le16(p->haddr, p->size, val_le);
+                } else {
+                    cpu_loop_exit_atomic(env_cpu(env), ra);
+                }
+            }
+        }
+        /* fall through */
+    case MO_ATOM_IFALIGN:
+    case MO_ATOM_NONE:
+        stq_le_p(p->haddr, int128_getlo(val_le));
+        return store_bytes_leN(p->haddr + 8, p->size - 8, int128_gethi(val_le));
+    case MO_ATOM_SUBALIGN:
+        store_parts_leN(p->haddr, 8, int128_getlo(val_le));
+        return store_parts_leN(p->haddr + 8, p->size - 8, int128_gethi(val_le));
+    default:
+        g_assert_not_reached();
+    }
+}
+
 static void do_st_1(CPUArchState *env, MMULookupPageData *p, uint8_t val,
                     int mmu_idx, uintptr_t ra)
 {
@@ -2770,6 +2938,80 @@ void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
     do_st8_mmu(env, addr, val, oi, retaddr);
 }
 
+static void do_st16_mmu(CPUArchState *env, target_ulong addr, Int128 val,
+                        MemOpIdx oi, uintptr_t ra)
+{
+    MMULookupLocals l;
+    bool crosspage;
+    uint64_t a, b;
+    int first;
+
+    crosspage = mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE, &l);
+    if (likely(!crosspage)) {
+        /* Swap to host endian if necessary, then store. */
+        if (l.memop & MO_BSWAP) {
+            val = bswap128(val);
+        }
+        if (unlikely(l.page[0].flags & TLB_MMIO)) {
+            QEMU_IOTHREAD_LOCK_GUARD();
+            if (HOST_BIG_ENDIAN) {
+                b = int128_getlo(val), a = int128_gethi(val);
+            } else {
+                a = int128_getlo(val), b = int128_gethi(val);
+            }
+            io_writex(env, l.page[0].full, l.mmu_idx, a, addr, ra, MO_64);
+            io_writex(env, l.page[0].full, l.mmu_idx, b, addr + 8, ra, MO_64);
+        } else if (unlikely(l.page[0].flags & TLB_DISCARD_WRITE)) {
+            /* nothing */
+        } else {
+            store_atom_16(env, ra, l.page[0].haddr, l.memop, val);
+        }
+        return;
+    }
+
+    first = l.page[0].size;
+    if (first == 8) {
+        MemOp mop8 = (l.memop & ~(MO_SIZE | MO_BSWAP)) | MO_64;
+
+        if (l.memop & MO_BSWAP) {
+            val = bswap128(val);
+        }
+        if (HOST_BIG_ENDIAN) {
+            b = int128_getlo(val), a = int128_gethi(val);
+        } else {
+            a = int128_getlo(val), b = int128_gethi(val);
+        }
+        do_st_8(env, &l.page[0], a, l.mmu_idx, mop8, ra);
+        do_st_8(env, &l.page[1], b, l.mmu_idx, mop8, ra);
+        return;
+    }
+
+    if ((l.memop & MO_BSWAP) != MO_LE) {
+        val = bswap128(val);
+    }
+    if (first < 8) {
+        do_st_leN(env, &l.page[0], int128_getlo(val), l.mmu_idx, l.memop, ra);
+        val = int128_urshift(val, first * 8);
+        do_st16_leN(env, &l.page[1], val, l.mmu_idx, l.memop, ra);
+    } else {
+        b = do_st16_leN(env, &l.page[0], val, l.mmu_idx, l.memop, ra);
+        do_st_leN(env, &l.page[1], b, l.mmu_idx, l.memop, ra);
+    }
+}
+
+void helper_st16_mmu(CPUArchState *env, target_ulong addr, Int128 val,
+                     MemOpIdx oi, uintptr_t retaddr)
+{
+    tcg_debug_assert((get_memop(oi) & MO_SIZE) == MO_128);
+    do_st16_mmu(env, addr, val, oi, retaddr);
+}
+
+void helper_st_i128(CPUArchState *env, target_ulong addr, Int128 val,
+                    MemOpIdx oi)
+{
+    helper_st16_mmu(env, addr, val, oi, GETPC());
+}
+
 /*
  * Store Helpers for cpu_ldst.h
  */
@@ -2834,58 +3076,20 @@ void cpu_stq_le_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
     plugin_store_cb(env, addr, oi);
 }
 
-void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
-                     MemOpIdx oi, uintptr_t ra)
+void cpu_st16_be_mmu(CPUArchState *env, target_ulong addr, Int128 val,
+                     MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int mmu_idx = get_mmuidx(oi);
-    MemOpIdx new_oi;
-    unsigned a_bits;
-
-    tcg_debug_assert((mop & (MO_BSWAP|MO_SSIZE)) == (MO_BE|MO_128));
-    a_bits = get_alignment_bits(mop);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_STORE,
-                             mmu_idx, ra);
-    }
-
-    /* Construct an unaligned 64-bit replacement MemOpIdx. */
-    mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
-    new_oi = make_memop_idx(mop, mmu_idx);
-
-    helper_stq_mmu(env, addr, int128_gethi(val), new_oi, ra);
-    helper_stq_mmu(env, addr + 8, int128_getlo(val), new_oi, ra);
-
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP|MO_SIZE)) == (MO_BE|MO_128));
+    do_st16_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
-void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
-                     MemOpIdx oi, uintptr_t ra)
+void cpu_st16_le_mmu(CPUArchState *env, target_ulong addr, Int128 val,
+                     MemOpIdx oi, uintptr_t retaddr)
 {
-    MemOp mop = get_memop(oi);
-    int mmu_idx = get_mmuidx(oi);
-    MemOpIdx new_oi;
-    unsigned a_bits;
-
-    tcg_debug_assert((mop & (MO_BSWAP|MO_SSIZE)) == (MO_LE|MO_128));
-    a_bits = get_alignment_bits(mop);
-
-    /* Handle CPU specific unaligned behaviour */
-    if (addr & ((1 << a_bits) - 1)) {
-        cpu_unaligned_access(env_cpu(env), addr, MMU_DATA_STORE,
-                             mmu_idx, ra);
-    }
-
-    /* Construct an unaligned 64-bit replacement MemOpIdx. */
-    mop = (mop & ~(MO_SIZE | MO_AMASK)) | MO_64 | MO_UNALN;
-    new_oi = make_memop_idx(mop, mmu_idx);
-
-    helper_stq_mmu(env, addr, int128_getlo(val), new_oi, ra);
-    helper_stq_mmu(env, addr + 8, int128_gethi(val), new_oi, ra);
-
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
+    tcg_debug_assert((get_memop(oi) & (MO_BSWAP|MO_SIZE)) == (MO_LE|MO_128));
+    do_st16_mmu(env, addr, val, oi, retaddr);
+    plugin_store_cb(env, addr, oi);
 }
 
 #include "ldst_common.c.inc"
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index d9f9766b7f..8f86254eb4 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -1121,18 +1121,45 @@ uint64_t cpu_ldq_le_mmu(CPUArchState *env, abi_ptr addr,
     return cpu_to_le64(ret);
 }
 
-Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
-                       MemOpIdx oi, uintptr_t ra)
+static Int128 do_ld16_he_mmu(CPUArchState *env, abi_ptr addr,
+                             MemOp mop, uintptr_t ra)
 {
     void *haddr;
     Int128 ret;
 
-    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_BE));
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    memcpy(&ret, haddr, 16);
+    tcg_debug_assert((mop & MO_SIZE) == MO_128);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_LOAD);
+    ret = load_atom_16(env, ra, haddr, mop);
     clear_helper_retaddr();
-    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
+    return ret;
+}
 
+Int128 helper_ld16_mmu(CPUArchState *env, target_ulong addr,
+                       MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    Int128 ret = do_ld16_he_mmu(env, addr, mop, ra);
+
+    if (mop & MO_BSWAP) {
+        ret = bswap128(ret);
+    }
+    return ret;
+}
+
+Int128 helper_ld_i128(CPUArchState *env, target_ulong addr, MemOpIdx oi)
+{
+    return helper_ld16_mmu(env, addr, oi, GETPC());
+}
+
+Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
+                       MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+    Int128 ret;
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
+    ret = do_ld16_he_mmu(env, addr, mop, ra);
+    qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
     if (!HOST_BIG_ENDIAN) {
         ret = bswap128(ret);
     }
@@ -1142,15 +1169,12 @@ Int128 cpu_ld16_be_mmu(CPUArchState *env, abi_ptr addr,
 Int128 cpu_ld16_le_mmu(CPUArchState *env, abi_ptr addr,
                        MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
     Int128 ret;
 
-    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_LE));
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_LOAD);
-    memcpy(&ret, haddr, 16);
-    clear_helper_retaddr();
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
+    ret = do_ld16_he_mmu(env, addr, mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_R);
-
     if (HOST_BIG_ENDIAN) {
         ret = bswap128(ret);
     }
@@ -1307,33 +1331,57 @@ void cpu_stq_le_mmu(CPUArchState *env, abi_ptr addr, uint64_t val,
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
-void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr,
-                     Int128 val, MemOpIdx oi, uintptr_t ra)
+static void do_st16_he_mmu(CPUArchState *env, abi_ptr addr, Int128 val,
+                           MemOp mop, uintptr_t ra)
 {
     void *haddr;
 
-    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_BE));
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
+    tcg_debug_assert((mop & MO_SIZE) == MO_128);
+    haddr = cpu_mmu_lookup(env, addr, mop, ra, MMU_DATA_STORE);
+    store_atom_16(env, ra, haddr, mop, val);
+    clear_helper_retaddr();
+}
+
+void helper_st16_mmu(CPUArchState *env, target_ulong addr, Int128 val,
+                     MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    if (mop & MO_BSWAP) {
+        val = bswap128(val);
+    }
+    do_st16_he_mmu(env, addr, val, mop, ra);
+}
+
+void helper_st_i128(CPUArchState *env, target_ulong addr,
+                    Int128 val, MemOpIdx oi)
+{
+    helper_st16_mmu(env, addr, val, oi, GETPC());
+}
+
+void cpu_st16_be_mmu(CPUArchState *env, abi_ptr addr,
+                     Int128 val, MemOpIdx oi, uintptr_t ra)
+{
+    MemOp mop = get_memop(oi);
+
+    tcg_debug_assert((mop & MO_BSWAP) == MO_BE);
     if (!HOST_BIG_ENDIAN) {
         val = bswap128(val);
     }
-    memcpy(haddr, &val, 16);
-    clear_helper_retaddr();
+    do_st16_he_mmu(env, addr, val, mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
 void cpu_st16_le_mmu(CPUArchState *env, abi_ptr addr,
                      Int128 val, MemOpIdx oi, uintptr_t ra)
 {
-    void *haddr;
+    MemOp mop = get_memop(oi);
 
-    tcg_debug_assert((get_memop(oi) & (MO_BSWAP | MO_SIZE)) == (MO_128 | MO_LE));
-    haddr = cpu_mmu_lookup(env, addr, oi, ra, MMU_DATA_STORE);
+    tcg_debug_assert((mop & MO_BSWAP) == MO_LE);
     if (HOST_BIG_ENDIAN) {
         val = bswap128(val);
     }
-    memcpy(haddr, &val, 16);
-    clear_helper_retaddr();
+    do_st16_he_mmu(env, addr, val, mop, ra);
     qemu_plugin_vcpu_mem_cb(env_cpu(env), addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 3136cef81a..9101d334b6 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -3119,6 +3119,48 @@ void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, MemOp memop)
     }
 }
 
+/*
+ * Return true if @mop, without knowledge of the pointer alignment,
+ * does not require 16-byte atomicity, and it would be adventagous
+ * to avoid a call to a helper function.
+ */
+static bool use_two_i64_for_i128(MemOp mop)
+{
+#ifdef CONFIG_SOFTMMU
+    /* Two softmmu tlb lookups is larger than one function call. */
+    return false;
+#else
+    /*
+     * For user-only, two 64-bit operations may well be smaller than a call.
+     * Determine if that would be legal for the requested atomicity.
+     */
+    MemOp atom = mop & MO_ATOM_MASK;
+    MemOp atmax = mop & MO_ATMAX_MASK;
+
+    /* In a serialized context, no atomicity is required. */
+    if (!(tcg_ctx->gen_tb->cflags & CF_PARALLEL)) {
+        return true;
+    }
+
+    if (atmax == MO_ATMAX_SIZE) {
+        atmax = mop & MO_SIZE;
+    } else {
+        atmax >>= MO_ATMAX_SHIFT;
+    }
+    switch (atom) {
+    case MO_ATOM_NONE:
+        return true;
+    case MO_ATOM_IFALIGN:
+    case MO_ATOM_SUBALIGN:
+        return atmax < MO_128;
+    case MO_ATOM_WITHIN16:
+        return atmax == MO_8;
+    default:
+        g_assert_not_reached();
+    }
+#endif
+}
+
 static void canonicalize_memop_i128_as_i64(MemOp ret[2], MemOp orig)
 {
     MemOp mop_1 = orig, mop_2;
@@ -3164,93 +3206,113 @@ static void canonicalize_memop_i128_as_i64(MemOp ret[2], MemOp orig)
     ret[1] = mop_2;
 }
 
+#if TARGET_LONG_BITS == 64
+#define tcg_temp_ebb_new  tcg_temp_ebb_new_i64
+#else
+#define tcg_temp_ebb_new  tcg_temp_ebb_new_i32
+#endif
+
 void tcg_gen_qemu_ld_i128(TCGv_i128 val, TCGv addr, TCGArg idx, MemOp memop)
 {
-    MemOp mop[2];
-    TCGv addr_p8;
-    TCGv_i64 x, y;
+    MemOpIdx oi = make_memop_idx(memop, idx);
 
-    canonicalize_memop_i128_as_i64(mop, memop);
+    tcg_debug_assert((memop & MO_SIZE) == MO_128);
+    tcg_debug_assert((memop & MO_SIGN) == 0);
 
     tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_ST_LD);
     addr = plugin_prep_mem_callbacks(addr);
 
-    /* TODO: respect atomicity of the operation. */
     /* TODO: allow the tcg backend to see the whole operation. */
 
-    /*
-     * Since there are no global TCGv_i128, there is no visible state
-     * changed if the second load faults.  Load directly into the two
-     * subwords.
-     */
-    if ((memop & MO_BSWAP) == MO_LE) {
-        x = TCGV128_LOW(val);
-        y = TCGV128_HIGH(val);
+    if (use_two_i64_for_i128(memop)) {
+        MemOp mop[2];
+        TCGv addr_p8;
+        TCGv_i64 x, y;
+
+        canonicalize_memop_i128_as_i64(mop, memop);
+
+        /*
+         * Since there are no global TCGv_i128, there is no visible state
+         * changed if the second load faults.  Load directly into the two
+         * subwords.
+         */
+        if ((memop & MO_BSWAP) == MO_LE) {
+            x = TCGV128_LOW(val);
+            y = TCGV128_HIGH(val);
+        } else {
+            x = TCGV128_HIGH(val);
+            y = TCGV128_LOW(val);
+        }
+
+        gen_ldst_i64(INDEX_op_qemu_ld_i64, x, addr, mop[0], idx);
+
+        if ((mop[0] ^ memop) & MO_BSWAP) {
+            tcg_gen_bswap64_i64(x, x);
+        }
+
+        addr_p8 = tcg_temp_ebb_new();
+        tcg_gen_addi_tl(addr_p8, addr, 8);
+        gen_ldst_i64(INDEX_op_qemu_ld_i64, y, addr_p8, mop[1], idx);
+        tcg_temp_free(addr_p8);
+
+        if ((mop[0] ^ memop) & MO_BSWAP) {
+            tcg_gen_bswap64_i64(y, y);
+        }
     } else {
-        x = TCGV128_HIGH(val);
-        y = TCGV128_LOW(val);
+        gen_helper_ld_i128(val, cpu_env, addr, tcg_constant_i32(oi));
     }
 
-    gen_ldst_i64(INDEX_op_qemu_ld_i64, x, addr, mop[0], idx);
-
-    if ((mop[0] ^ memop) & MO_BSWAP) {
-        tcg_gen_bswap64_i64(x, x);
-    }
-
-    addr_p8 = tcg_temp_new();
-    tcg_gen_addi_tl(addr_p8, addr, 8);
-    gen_ldst_i64(INDEX_op_qemu_ld_i64, y, addr_p8, mop[1], idx);
-    tcg_temp_free(addr_p8);
-
-    if ((mop[0] ^ memop) & MO_BSWAP) {
-        tcg_gen_bswap64_i64(y, y);
-    }
-
-    plugin_gen_mem_callbacks(addr, make_memop_idx(memop, idx),
-                             QEMU_PLUGIN_MEM_R);
+    plugin_gen_mem_callbacks(addr, oi, QEMU_PLUGIN_MEM_R);
 }
 
 void tcg_gen_qemu_st_i128(TCGv_i128 val, TCGv addr, TCGArg idx, MemOp memop)
 {
-    MemOp mop[2];
-    TCGv addr_p8;
-    TCGv_i64 x, y;
+    MemOpIdx oi = make_memop_idx(memop, idx);
 
-    canonicalize_memop_i128_as_i64(mop, memop);
+    tcg_debug_assert((memop & MO_SIZE) == MO_128);
+    tcg_debug_assert((memop & MO_SIGN) == 0);
 
     tcg_gen_req_mo(TCG_MO_ST_LD | TCG_MO_ST_ST);
     addr = plugin_prep_mem_callbacks(addr);
 
-    /* TODO: respect atomicity of the operation. */
     /* TODO: allow the tcg backend to see the whole operation. */
 
-    if ((memop & MO_BSWAP) == MO_LE) {
-        x = TCGV128_LOW(val);
-        y = TCGV128_HIGH(val);
+    if (use_two_i64_for_i128(memop)) {
+        MemOp mop[2];
+        TCGv addr_p8;
+        TCGv_i64 x, y;
+
+        canonicalize_memop_i128_as_i64(mop, memop);
+
+        if ((memop & MO_BSWAP) == MO_LE) {
+            x = TCGV128_LOW(val);
+            y = TCGV128_HIGH(val);
+        } else {
+            x = TCGV128_HIGH(val);
+            y = TCGV128_LOW(val);
+        }
+
+        addr_p8 = tcg_temp_ebb_new();
+        if ((mop[0] ^ memop) & MO_BSWAP) {
+            TCGv_i64 t = tcg_temp_ebb_new_i64();
+
+            tcg_gen_bswap64_i64(t, x);
+            gen_ldst_i64(INDEX_op_qemu_st_i64, t, addr, mop[0], idx);
+            tcg_gen_bswap64_i64(t, y);
+            tcg_gen_addi_tl(addr_p8, addr, 8);
+            gen_ldst_i64(INDEX_op_qemu_st_i64, t, addr_p8, mop[1], idx);
+            tcg_temp_free_i64(t);
+        } else {
+            gen_ldst_i64(INDEX_op_qemu_st_i64, x, addr, mop[0], idx);
+            tcg_gen_addi_tl(addr_p8, addr, 8);
+            gen_ldst_i64(INDEX_op_qemu_st_i64, y, addr_p8, mop[1], idx);
+        }
+        tcg_temp_free(addr_p8);
     } else {
-        x = TCGV128_HIGH(val);
-        y = TCGV128_LOW(val);
+        gen_helper_st_i128(cpu_env, addr, val, tcg_constant_i32(oi));
     }
 
-    addr_p8 = tcg_temp_new();
-    if ((mop[0] ^ memop) & MO_BSWAP) {
-        TCGv_i64 t = tcg_temp_ebb_new_i64();
-
-        tcg_gen_bswap64_i64(t, x);
-        gen_ldst_i64(INDEX_op_qemu_st_i64, t, addr, mop[0], idx);
-        tcg_gen_bswap64_i64(t, y);
-        tcg_gen_addi_tl(addr_p8, addr, 8);
-        gen_ldst_i64(INDEX_op_qemu_st_i64, t, addr_p8, mop[1], idx);
-        tcg_temp_free_i64(t);
-    } else {
-        gen_ldst_i64(INDEX_op_qemu_st_i64, x, addr, mop[0], idx);
-        tcg_gen_addi_tl(addr_p8, addr, 8);
-        gen_ldst_i64(INDEX_op_qemu_st_i64, y, addr_p8, mop[1], idx);
-    }
-    tcg_temp_free(addr_p8);
-
-    plugin_gen_mem_callbacks(addr, make_memop_idx(memop, idx),
-                             QEMU_PLUGIN_MEM_W);
+    plugin_gen_mem_callbacks(addr, oi, QEMU_PLUGIN_MEM_W);
 }
 
 static void tcg_gen_ext_i32(TCGv_i32 ret, TCGv_i32 val, MemOp opc)
diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index 07abbdee3f..e61121d6bf 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -423,6 +423,21 @@ static inline uint64_t load_atom_8_by_4(void *pv)
     }
 }
 
+/**
+ * load_atom_8_by_8_or_4:
+ * @pv: host address
+ *
+ * Load 8 bytes from aligned @pv, with at least 4-byte atomicity.
+ */
+static inline uint64_t load_atom_8_by_8_or_4(void *pv)
+{
+    if (HAVE_al8_fast) {
+        return load_atomic8(pv);
+    } else {
+        return load_atom_8_by_4(pv);
+    }
+}
+
 /**
  * load_atom_2:
  * @p: host address
@@ -555,6 +570,64 @@ static uint64_t load_atom_8(CPUArchState *env, uintptr_t ra,
     }
 }
 
+/**
+ * load_atom_16:
+ * @p: host address
+ * @memop: the full memory op
+ *
+ * Load 16 bytes from @p, honoring the atomicity of @memop.
+ */
+static Int128 load_atom_16(CPUArchState *env, uintptr_t ra,
+                           void *pv, MemOp memop)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    int atmax;
+    Int128 r;
+    uint64_t a, b;
+
+    /*
+     * If the host does not support 8-byte atomics, wait until we have
+     * examined the atomicity parameters below.
+     */
+    if (HAVE_al16_fast && likely((pi & 15) == 0)) {
+        return load_atomic16(pv);
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+    switch (atmax) {
+    case MO_8:
+        memcpy(&r, pv, 16);
+        return r;
+    case MO_16:
+        a = load_atom_8_by_2(pv);
+        b = load_atom_8_by_2(pv + 8);
+        break;
+    case MO_32:
+        a = load_atom_8_by_4(pv);
+        b = load_atom_8_by_4(pv + 8);
+        break;
+    case MO_64:
+        if (!HAVE_al8) {
+            cpu_loop_exit_atomic(env_cpu(env), ra);
+        }
+        a = load_atomic8(pv);
+        b = load_atomic8(pv + 8);
+        break;
+    case -MO_64:
+        if (!HAVE_al8) {
+            cpu_loop_exit_atomic(env_cpu(env), ra);
+        }
+        a = load_atom_extract_al8x2(pv);
+        b = load_atom_extract_al8x2(pv + 8);
+        break;
+    case MO_128:
+        return load_atomic16_or_exit(env, ra, pv);
+    default:
+        g_assert_not_reached();
+    }
+    return int128_make128(HOST_BIG_ENDIAN ? b : a, HOST_BIG_ENDIAN ? a : b);
+}
+
 /**
  * store_atomic2:
  * @pv: host address
@@ -596,6 +669,40 @@ static inline void store_atomic8(void *pv, uint64_t val)
     qatomic_set__nocheck(p, val);
 }
 
+/**
+ * store_atomic16:
+ * @pv: host address
+ * @val: value to store
+ *
+ * Atomically store 16 aligned bytes to @pv.
+ */
+static inline void store_atomic16(void *pv, Int128 val)
+{
+#if defined(CONFIG_ATOMIC128)
+    __uint128_t *pu = __builtin_assume_aligned(pv, 16);
+    Int128Alias new;
+
+    new.s = val;
+    qatomic_set__nocheck(pu, new.u);
+#elif defined(CONFIG_CMPXCHG128)
+    __uint128_t *pu = __builtin_assume_aligned(pv, 16);
+    __uint128_t o;
+    Int128Alias n;
+
+    /*
+     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
+     * defer to libatomic, so we must use __sync_val_compare_and_swap_16
+     * and accept the sequential consistency that comes with it.
+     */
+    n.s = val;
+    do {
+        o = *pu;
+    } while (!__sync_bool_compare_and_swap_16(pu, o, n.u));
+#else
+    qemu_build_not_reached();
+#endif
+}
+
 /**
  * store_atom_4x2
  */
@@ -1039,3 +1146,85 @@ static void store_atom_8(CPUArchState *env, uintptr_t ra,
     }
     cpu_loop_exit_atomic(env_cpu(env), ra);
 }
+
+/**
+ * store_atom_16:
+ * @p: host address
+ * @val: the value to store
+ * @memop: the full memory op
+ *
+ * Store 16 bytes to @p, honoring the atomicity of @memop.
+ */
+static void store_atom_16(CPUArchState *env, uintptr_t ra,
+                          void *pv, MemOp memop, Int128 val)
+{
+    uintptr_t pi = (uintptr_t)pv;
+    uint64_t a, b;
+    int atmax;
+
+    if (HAVE_al16_fast && likely((pi & 15) == 0)) {
+        store_atomic16(pv, val);
+        return;
+    }
+
+    atmax = required_atomicity(env, pi, memop);
+
+    a = HOST_BIG_ENDIAN ? int128_gethi(val) : int128_getlo(val);
+    b = HOST_BIG_ENDIAN ? int128_getlo(val) : int128_gethi(val);
+    switch (atmax) {
+    case MO_8:
+        memcpy(pv, &val, 16);
+        return;
+    case MO_16:
+        store_atom_8_by_2(pv, a);
+        store_atom_8_by_2(pv + 8, b);
+        return;
+    case MO_32:
+        store_atom_8_by_4(pv, a);
+        store_atom_8_by_4(pv + 8, b);
+        return;
+    case MO_64:
+        if (HAVE_al8) {
+            store_atomic8(pv, a);
+            store_atomic8(pv + 8, b);
+            return;
+        }
+        break;
+    case -MO_64:
+        if (HAVE_al16) {
+            uint64_t val_le;
+            int s2 = pi & 15;
+            int s1 = 16 - s2;
+
+            if (HOST_BIG_ENDIAN) {
+                val = bswap128(val);
+            }
+            switch (s2) {
+            case 1 ... 7:
+                val_le = store_whole_le16(pv, s1, val);
+                store_bytes_leN(pv + s1, s2, val_le);
+                break;
+            case 9 ... 15:
+                store_bytes_leN(pv, s1, int128_getlo(val));
+                val = int128_urshift(val, s1 * 8);
+                store_whole_le16(pv + s1, s2, val);
+                break;
+            case 0: /* aligned */
+            case 8: /* atmax MO_64 */
+            default:
+                g_assert_not_reached();
+            }
+            return;
+        }
+        break;
+    case MO_128:
+        if (HAVE_al16) {
+            store_atomic16(pv, val);
+            return;
+        }
+        break;
+    default:
+        g_assert_not_reached();
+    }
+    cpu_loop_exit_atomic(env_cpu(env), ra);
+}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 13/57] meson: Detect atomic128 support with optimization
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (11 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 12/57] tcg: Add 128-bit guest memory primitives Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:29   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 14/57] tcg/i386: Add have_atomic16 Richard Henderson
                   ` (44 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

There is an edge condition prior to gcc13 for which optimization
is required to generate 16-byte atomic sequences.  Detect this.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/ldst_atomicity.c.inc | 38 ++++++++++++++++++-------
 meson.build                    | 52 ++++++++++++++++++++++------------
 2 files changed, 61 insertions(+), 29 deletions(-)

diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index e61121d6bf..c43f101ebe 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -16,6 +16,23 @@
 #endif
 #define HAVE_al8_fast      (ATOMIC_REG_SIZE >= 8)
 
+/*
+ * If __alignof(unsigned __int128) < 16, GCC may refuse to inline atomics
+ * that are supported by the host, e.g. s390x.  We can force the pointer to
+ * have our known alignment with __builtin_assume_aligned, however prior to
+ * GCC 13 that was only reliable with optimization enabled.  See
+ *   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107389
+ */
+#if defined(CONFIG_ATOMIC128_OPT)
+# if !defined(__OPTIMIZE__)
+#  define ATTRIBUTE_ATOMIC128_OPT  __attribute__((optimize("O1")))
+# endif
+# define CONFIG_ATOMIC128
+#endif
+#ifndef ATTRIBUTE_ATOMIC128_OPT
+# define ATTRIBUTE_ATOMIC128_OPT
+#endif
+
 #if defined(CONFIG_ATOMIC128)
 # define HAVE_al16_fast    true
 #else
@@ -136,7 +153,8 @@ static inline uint64_t load_atomic8(void *pv)
  *
  * Atomically load 16 aligned bytes from @pv.
  */
-static inline Int128 load_atomic16(void *pv)
+static inline Int128 ATTRIBUTE_ATOMIC128_OPT
+load_atomic16(void *pv)
 {
 #ifdef CONFIG_ATOMIC128
     __uint128_t *p = __builtin_assume_aligned(pv, 16);
@@ -340,7 +358,8 @@ static uint64_t load_atom_extract_al16_or_exit(CPUArchState *env, uintptr_t ra,
  * cross an 16-byte boundary then the access must be 16-byte atomic,
  * otherwise the access must be 8-byte atomic.
  */
-static inline uint64_t load_atom_extract_al16_or_al8(void *pv, int s)
+static inline uint64_t ATTRIBUTE_ATOMIC128_OPT
+load_atom_extract_al16_or_al8(void *pv, int s)
 {
 #if defined(CONFIG_ATOMIC128)
     uintptr_t pi = (uintptr_t)pv;
@@ -676,28 +695,24 @@ static inline void store_atomic8(void *pv, uint64_t val)
  *
  * Atomically store 16 aligned bytes to @pv.
  */
-static inline void store_atomic16(void *pv, Int128 val)
+static inline void ATTRIBUTE_ATOMIC128_OPT
+store_atomic16(void *pv, Int128Alias val)
 {
 #if defined(CONFIG_ATOMIC128)
     __uint128_t *pu = __builtin_assume_aligned(pv, 16);
-    Int128Alias new;
-
-    new.s = val;
-    qatomic_set__nocheck(pu, new.u);
+    qatomic_set__nocheck(pu, val.u);
 #elif defined(CONFIG_CMPXCHG128)
     __uint128_t *pu = __builtin_assume_aligned(pv, 16);
     __uint128_t o;
-    Int128Alias n;
 
     /*
      * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
      * defer to libatomic, so we must use __sync_val_compare_and_swap_16
      * and accept the sequential consistency that comes with it.
      */
-    n.s = val;
     do {
         o = *pu;
-    } while (!__sync_bool_compare_and_swap_16(pu, o, n.u));
+    } while (!__sync_bool_compare_and_swap_16(pu, o, val.u));
 #else
     qemu_build_not_reached();
 #endif
@@ -779,7 +794,8 @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
  *
  * Atomically store @val to @p masked by @msk.
  */
-static void store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
+static void ATTRIBUTE_ATOMIC128_OPT
+store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
 {
 #if defined(CONFIG_ATOMIC128)
     __uint128_t *pu, old, new;
diff --git a/meson.build b/meson.build
index 77d42898c8..4bbdbcef37 100644
--- a/meson.build
+++ b/meson.build
@@ -2241,23 +2241,21 @@ config_host_data.set('HAVE_BROKEN_SIZE_MAX', not cc.compiles('''
         return printf("%zu", SIZE_MAX);
     }''', args: ['-Werror']))
 
-atomic_test = '''
+# See if 64-bit atomic operations are supported.
+# Note that without __atomic builtins, we can only
+# assume atomic loads/stores max at pointer size.
+config_host_data.set('CONFIG_ATOMIC64', cc.links('''
   #include <stdint.h>
   int main(void)
   {
-    @0@ x = 0, y = 0;
+    uint64_t x = 0, y = 0;
     y = __atomic_load_n(&x, __ATOMIC_RELAXED);
     __atomic_store_n(&x, y, __ATOMIC_RELAXED);
     __atomic_compare_exchange_n(&x, &y, x, 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
     __atomic_exchange_n(&x, y, __ATOMIC_RELAXED);
     __atomic_fetch_add(&x, y, __ATOMIC_RELAXED);
     return 0;
-  }'''
-
-# See if 64-bit atomic operations are supported.
-# Note that without __atomic builtins, we can only
-# assume atomic loads/stores max at pointer size.
-config_host_data.set('CONFIG_ATOMIC64', cc.links(atomic_test.format('uint64_t')))
+  }'''))
 
 has_int128 = cc.links('''
   __int128_t a;
@@ -2275,21 +2273,39 @@ if has_int128
   # "do we have 128-bit atomics which are handled inline and specifically not
   # via libatomic". The reason we can't use libatomic is documented in the
   # comment starting "GCC is a house divided" in include/qemu/atomic128.h.
-  has_atomic128 = cc.links(atomic_test.format('unsigned __int128'))
+  # We only care about these operations on 16-byte aligned pointers, so
+  # force 16-byte alignment of the pointer, which may be greater than
+  # __alignof(unsigned __int128) for the host.
+  atomic_test_128 = '''
+    int main(int ac, char **av) {
+      unsigned __int128 *p = __builtin_assume_aligned(av[ac - 1], sizeof(16));
+      p[1] = __atomic_load_n(&p[0], __ATOMIC_RELAXED);
+      __atomic_store_n(&p[2], p[3], __ATOMIC_RELAXED);
+      __atomic_compare_exchange_n(&p[4], &p[5], p[6], 0, __ATOMIC_RELAXED, __ATOMIC_RELAXED);
+      return 0;
+    }'''
+  has_atomic128 = cc.links(atomic_test_128)
 
   config_host_data.set('CONFIG_ATOMIC128', has_atomic128)
 
   if not has_atomic128
-    has_cmpxchg128 = cc.links('''
-      int main(void)
-      {
-        unsigned __int128 x = 0, y = 0;
-        __sync_val_compare_and_swap_16(&x, y, x);
-        return 0;
-      }
-    ''')
+    # Even with __builtin_assume_aligned, the above test may have failed
+    # without optimization enabled.  Try again with optimizations locally
+    # enabled for the function.  See
+    #   https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107389
+    has_atomic128_opt = cc.links('__attribute__((optimize("O1")))' + atomic_test_128)
+    config_host_data.set('CONFIG_ATOMIC128_OPT', has_atomic128_opt)
 
-    config_host_data.set('CONFIG_CMPXCHG128', has_cmpxchg128)
+    if not has_atomic128_opt
+      config_host_data.set('CONFIG_CMPXCHG128', cc.links('''
+        int main(void)
+        {
+          unsigned __int128 x = 0, y = 0;
+          __sync_val_compare_and_swap_16(&x, y, x);
+          return 0;
+        }
+      '''))
+    endif
   endif
 endif
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 14/57] tcg/i386: Add have_atomic16
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (12 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 13/57] meson: Detect atomic128 support with optimization Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:34   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc Richard Henderson
                   ` (43 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Notice when Intel or AMD have guaranteed that vmovdqa is atomic.
The new variable will also be used in generated code.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/qemu/cpuid.h      | 18 ++++++++++++++++++
 tcg/i386/tcg-target.h     |  1 +
 tcg/i386/tcg-target.c.inc | 27 +++++++++++++++++++++++++++
 3 files changed, 46 insertions(+)

diff --git a/include/qemu/cpuid.h b/include/qemu/cpuid.h
index 1451e8ef2f..35325f1995 100644
--- a/include/qemu/cpuid.h
+++ b/include/qemu/cpuid.h
@@ -71,6 +71,24 @@
 #define bit_LZCNT       (1 << 5)
 #endif
 
+/*
+ * Signatures for different CPU implementations as returned from Leaf 0.
+ */
+
+#ifndef signature_INTEL_ecx
+/* "Genu" "ineI" "ntel" */
+#define signature_INTEL_ebx     0x756e6547
+#define signature_INTEL_edx     0x49656e69
+#define signature_INTEL_ecx     0x6c65746e
+#endif
+
+#ifndef signature_AMD_ecx
+/* "Auth" "enti" "cAMD" */
+#define signature_AMD_ebx       0x68747541
+#define signature_AMD_edx       0x69746e65
+#define signature_AMD_ecx       0x444d4163
+#endif
+
 static inline unsigned xgetbv_low(unsigned c)
 {
     unsigned a, d;
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index d4f2a6f8c2..0421776cb8 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -120,6 +120,7 @@ extern bool have_avx512dq;
 extern bool have_avx512vbmi2;
 extern bool have_avx512vl;
 extern bool have_movbe;
+extern bool have_atomic16;
 
 /* optional instructions */
 #define TCG_TARGET_HAS_div2_i32         1
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index bb603e7968..f838683fc3 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -185,6 +185,7 @@ bool have_avx512dq;
 bool have_avx512vbmi2;
 bool have_avx512vl;
 bool have_movbe;
+bool have_atomic16;
 
 #ifdef CONFIG_CPUID_H
 static bool have_bmi2;
@@ -4024,6 +4025,32 @@ static void tcg_target_init(TCGContext *s)
                     have_avx512dq = (b7 & bit_AVX512DQ) != 0;
                     have_avx512vbmi2 = (c7 & bit_AVX512VBMI2) != 0;
                 }
+
+                /*
+                 * The Intel SDM has added:
+                 *   Processors that enumerate support for Intel® AVX
+                 *   (by setting the feature flag CPUID.01H:ECX.AVX[bit 28])
+                 *   guarantee that the 16-byte memory operations performed
+                 *   by the following instructions will always be carried
+                 *   out atomically:
+                 *   - MOVAPD, MOVAPS, and MOVDQA.
+                 *   - VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
+                 *   - VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded
+                 *     with EVEX.128 and k0 (masking disabled).
+                 * Note that these instructions require the linear addresses
+                 * of their memory operands to be 16-byte aligned.
+                 *
+                 * AMD has provided an even stronger guarantee that processors
+                 * with AVX provide 16-byte atomicity for all cachable,
+                 * naturally aligned single loads and stores, e.g. MOVDQU.
+                 *
+                 * See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
+                 */
+                if (have_avx1) {
+                    __cpuid(0, a, b, c, d);
+                    have_atomic16 = (c == signature_INTEL_ecx ||
+                                     c == signature_AMD_ecx);
+                }
             }
         }
     }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (13 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 14/57] tcg/i386: Add have_atomic16 Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:37   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 16/57] accel/tcg: Add aarch64 specific support in ldst_atomicity Richard Henderson
                   ` (42 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Hosts using Intel and AMD AVX cpus are quite common.
Add fast paths through ldst_atomicity using this.

Only enable with CONFIG_INT128; some older clang versions do not
support __int128_t, and the inline assembly won't work on structures.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/ldst_atomicity.c.inc | 76 +++++++++++++++++++++++++++-------
 1 file changed, 60 insertions(+), 16 deletions(-)

diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index c43f101ebe..07bfa5c3c8 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -35,6 +35,14 @@
 
 #if defined(CONFIG_ATOMIC128)
 # define HAVE_al16_fast    true
+#elif defined(CONFIG_TCG_INTERPRETER)
+/*
+ * FIXME: host specific detection for this is in tcg/$host/,
+ * but we're using tcg/tci/ instead.
+ */
+# define HAVE_al16_fast    false
+#elif defined(__x86_64__) && defined(CONFIG_INT128)
+# define HAVE_al16_fast    likely(have_atomic16)
 #else
 # define HAVE_al16_fast    false
 #endif
@@ -162,6 +170,12 @@ load_atomic16(void *pv)
 
     r.u = qatomic_read__nocheck(p);
     return r.s;
+#elif defined(__x86_64__) && defined(CONFIG_INT128)
+    Int128Alias r;
+
+    /* Via HAVE_al16_fast, have_atomic16 is true. */
+    asm("vmovdqa %1, %0" : "=x" (r.u) : "m" (*(Int128 *)pv));
+    return r.s;
 #else
     qemu_build_not_reached();
 #endif
@@ -383,6 +397,24 @@ load_atom_extract_al16_or_al8(void *pv, int s)
         r = qatomic_read__nocheck(p16);
     }
     return r >> shr;
+#elif defined(__x86_64__) && defined(CONFIG_INT128)
+    uintptr_t pi = (uintptr_t)pv;
+    int shr = (pi & 7) * 8;
+    uint64_t a, b;
+
+    /* Via HAVE_al16_fast, have_atomic16 is true. */
+    pv = (void *)(pi & ~7);
+    if (pi & 8) {
+        uint64_t *p8 = __builtin_assume_aligned(pv, 16, 8);
+        a = qatomic_read__nocheck(p8);
+        b = qatomic_read__nocheck(p8 + 1);
+    } else {
+        asm("vmovdqa %2, %0\n\tvpextrq $1, %0, %1"
+            : "=x"(a), "=r"(b) : "m" (*(__uint128_t *)pv));
+    }
+    asm("shrd %b2, %1, %0" : "+r"(a) : "r"(b), "c"(shr));
+
+    return a;
 #else
     qemu_build_not_reached();
 #endif
@@ -699,23 +731,35 @@ static inline void ATTRIBUTE_ATOMIC128_OPT
 store_atomic16(void *pv, Int128Alias val)
 {
 #if defined(CONFIG_ATOMIC128)
-    __uint128_t *pu = __builtin_assume_aligned(pv, 16);
-    qatomic_set__nocheck(pu, val.u);
-#elif defined(CONFIG_CMPXCHG128)
-    __uint128_t *pu = __builtin_assume_aligned(pv, 16);
-    __uint128_t o;
-
-    /*
-     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
-     * defer to libatomic, so we must use __sync_val_compare_and_swap_16
-     * and accept the sequential consistency that comes with it.
-     */
-    do {
-        o = *pu;
-    } while (!__sync_bool_compare_and_swap_16(pu, o, val.u));
-#else
-    qemu_build_not_reached();
+    {
+        __uint128_t *pu = __builtin_assume_aligned(pv, 16);
+        qatomic_set__nocheck(pu, val.u);
+        return;
+    }
 #endif
+#if defined(__x86_64__) && defined(CONFIG_INT128)
+    if (HAVE_al16_fast) {
+        asm("vmovdqa %1, %0" : "=m"(*(__uint128_t *)pv) : "x" (val.u));
+        return;
+    }
+#endif
+#if defined(CONFIG_CMPXCHG128)
+    {
+        __uint128_t *pu = __builtin_assume_aligned(pv, 16);
+        __uint128_t o;
+
+        /*
+         * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
+         * defer to libatomic, so we must use __sync_val_compare_and_swap_16
+         * and accept the sequential consistency that comes with it.
+         */
+        do {
+            o = *pu;
+        } while (!__sync_bool_compare_and_swap_16(pu, o, val.u));
+        return;
+    }
+#endif
+    qemu_build_not_reached();
 }
 
 /**
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 16/57] accel/tcg: Add aarch64 specific support in ldst_atomicity
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (14 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-03  7:06 ` [PATCH v4 17/57] tcg/aarch64: Detect have_lse, have_lse2 for linux Richard Henderson
                   ` (41 subsequent siblings)
  57 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

We have code in atomic128.h noting that through GCC 8, there
was no support for atomic operations on __uint128.  This has
been fixed in GCC 10.  But we can still improve over any
basic compare-and-swap loop using the ldxp/stxp instructions.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/ldst_atomicity.c.inc | 60 ++++++++++++++++++++++++++++++++--
 1 file changed, 57 insertions(+), 3 deletions(-)

diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index 07bfa5c3c8..2426b09aef 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -247,7 +247,22 @@ static Int128 load_atomic16_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
      * In system mode all guest pages are writable, and for user-only
      * we have just checked writability.  Try cmpxchg.
      */
-#if defined(CONFIG_CMPXCHG128)
+#if defined(__aarch64__)
+    /* We can do better than cmpxchg for AArch64.  */
+    {
+        uint64_t l, h;
+        uint32_t fail;
+
+        /* The load must be paired with the store to guarantee not tearing. */
+        asm("0: ldxp %0, %1, %3\n\t"
+            "stxp %w2, %0, %1, %3\n\t"
+            "cbnz %w2, 0b"
+            : "=&r"(l), "=&r"(h), "=&r"(fail) : "Q"(*p));
+
+        qemu_build_assert(!HOST_BIG_ENDIAN);
+        return int128_make128(l, h);
+    }
+#elif defined(CONFIG_CMPXCHG128)
     /* Swap 0 with 0, with the side-effect of returning the old value. */
     {
         Int128Alias r;
@@ -743,7 +758,22 @@ store_atomic16(void *pv, Int128Alias val)
         return;
     }
 #endif
-#if defined(CONFIG_CMPXCHG128)
+#if defined(__aarch64__)
+    /* We can do better than cmpxchg for AArch64.  */
+    {
+        uint64_t l, h, t;
+
+        qemu_build_assert(!HOST_BIG_ENDIAN);
+        l = int128_getlo(val.s);
+        h = int128_gethi(val.s);
+
+        asm("0: ldxp %0, xzr, %1\n\t"
+            "stxp %w0, %2, %3, %1\n\t"
+            "cbnz %w0, 0b"
+            : "=&r"(t), "=Q"(*(__uint128_t *)pv) : "r"(l), "r"(h));
+        return;
+    }
+#elif defined(CONFIG_CMPXCHG128)
     {
         __uint128_t *pu = __builtin_assume_aligned(pv, 16);
         __uint128_t o;
@@ -841,7 +871,31 @@ static void store_atom_insert_al8(uint64_t *p, uint64_t val, uint64_t msk)
 static void ATTRIBUTE_ATOMIC128_OPT
 store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
 {
-#if defined(CONFIG_ATOMIC128)
+#if defined(__aarch64__)
+    /*
+     * GCC only implements __sync* primitives for int128 on aarch64.
+     * We can do better without the barriers, and integrating the
+     * arithmetic into the load-exclusive/store-conditional pair.
+     */
+    uint64_t tl, th, vl, vh, ml, mh;
+    uint32_t fail;
+
+    qemu_build_assert(!HOST_BIG_ENDIAN);
+    vl = int128_getlo(val.s);
+    vh = int128_gethi(val.s);
+    ml = int128_getlo(msk.s);
+    mh = int128_gethi(msk.s);
+
+    asm("0: ldxp %[l], %[h], %[mem]\n\t"
+        "bic %[l], %[l], %[ml]\n\t"
+        "bic %[h], %[h], %[mh]\n\t"
+        "orr %[l], %[l], %[vl]\n\t"
+        "orr %[h], %[h], %[vh]\n\t"
+        "stxp %w[f], %[l], %[h], %[mem]\n\t"
+        "cbnz %w[f], 0b\n"
+        : [mem] "+Q"(*ps), [f] "=&r"(fail), [l] "=&r"(tl), [h] "=&r"(th)
+        : [vl] "r"(vl), [vh] "r"(vh), [ml] "r"(ml), [mh] "r"(mh));
+#elif defined(CONFIG_ATOMIC128)
     __uint128_t *pu, old, new;
 
     /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 17/57] tcg/aarch64: Detect have_lse, have_lse2 for linux
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (15 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 16/57] accel/tcg: Add aarch64 specific support in ldst_atomicity Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:41   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 18/57] tcg/aarch64: Detect have_lse, have_lse2 for darwin Richard Henderson
                   ` (40 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Notice when the host has additional atomic instructions.
The new variables will also be used in generated code.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.h     |  3 +++
 tcg/aarch64/tcg-target.c.inc | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index c0b0f614ba..3c0b0d312d 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -57,6 +57,9 @@ typedef enum {
 #define TCG_TARGET_CALL_ARG_I128        TCG_CALL_ARG_EVEN
 #define TCG_TARGET_CALL_RET_I128        TCG_CALL_RET_NORMAL
 
+extern bool have_lse;
+extern bool have_lse2;
+
 /* optional instructions */
 #define TCG_TARGET_HAS_div_i32          1
 #define TCG_TARGET_HAS_rem_i32          1
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index e6636c1f8b..fc551a3d10 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -13,6 +13,9 @@
 #include "../tcg-ldst.c.inc"
 #include "../tcg-pool.c.inc"
 #include "qemu/bitops.h"
+#ifdef __linux__
+#include <asm/hwcap.h>
+#endif
 
 /* We're going to re-use TCGType in setting of the SF bit, which controls
    the size of the operation performed.  If we know the values match, it
@@ -71,6 +74,9 @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
     return TCG_REG_X0 + slot;
 }
 
+bool have_lse;
+bool have_lse2;
+
 #define TCG_REG_TMP TCG_REG_X30
 #define TCG_VEC_TMP TCG_REG_V31
 
@@ -2899,6 +2905,12 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
 
 static void tcg_target_init(TCGContext *s)
 {
+#ifdef __linux__
+    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
+    have_lse = hwcap & HWCAP_ATOMICS;
+    have_lse2 = hwcap & HWCAP_USCAT;
+#endif
+
     tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffffu;
     tcg_target_available_regs[TCG_TYPE_I64] = 0xffffffffu;
     tcg_target_available_regs[TCG_TYPE_V64] = 0xffffffff00000000ull;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 18/57] tcg/aarch64: Detect have_lse, have_lse2 for darwin
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (16 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 17/57] tcg/aarch64: Detect have_lse, have_lse2 for linux Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:43   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 19/57] accel/tcg: Add have_lse2 support in ldst_atomicity Richard Henderson
                   ` (39 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

These features are present for Apple M1.

Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 28 ++++++++++++++++++++++++++++
 1 file changed, 28 insertions(+)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index fc551a3d10..3adc5fd3a3 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -16,6 +16,9 @@
 #ifdef __linux__
 #include <asm/hwcap.h>
 #endif
+#ifdef CONFIG_DARWIN
+#include <sys/sysctl.h>
+#endif
 
 /* We're going to re-use TCGType in setting of the SF bit, which controls
    the size of the operation performed.  If we know the values match, it
@@ -2903,6 +2906,27 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     }
 }
 
+#ifdef CONFIG_DARWIN
+static bool sysctl_for_bool(const char *name)
+{
+    int val = 0;
+    size_t len = sizeof(val);
+
+    if (sysctlbyname(name, &val, &len, NULL, 0) == 0) {
+        return val != 0;
+    }
+
+    /*
+     * We might in ask for properties not present in older kernels,
+     * but we're only asking about static properties, all of which
+     * should be 'int'.  So we shouln't see ENOMEM (val too small),
+     * or any of the other more exotic errors.
+     */
+    assert(errno == ENOENT);
+    return false;
+}
+#endif
+
 static void tcg_target_init(TCGContext *s)
 {
 #ifdef __linux__
@@ -2910,6 +2934,10 @@ static void tcg_target_init(TCGContext *s)
     have_lse = hwcap & HWCAP_ATOMICS;
     have_lse2 = hwcap & HWCAP_USCAT;
 #endif
+#ifdef CONFIG_DARWIN
+    have_lse = sysctl_for_bool("hw.optional.arm.FEAT_LSE");
+    have_lse2 = sysctl_for_bool("hw.optional.arm.FEAT_LSE2");
+#endif
 
     tcg_target_available_regs[TCG_TYPE_I32] = 0xffffffffu;
     tcg_target_available_regs[TCG_TYPE_I64] = 0xffffffffu;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 19/57] accel/tcg: Add have_lse2 support in ldst_atomicity
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (17 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 18/57] tcg/aarch64: Detect have_lse, have_lse2 for darwin Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-03  7:06 ` [PATCH v4 20/57] tcg: Introduce TCG_OPF_TYPE_MASK Richard Henderson
                   ` (38 subsequent siblings)
  57 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Add fast paths for FEAT_LSE2, using the detection in tcg.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 accel/tcg/ldst_atomicity.c.inc | 37 ++++++++++++++++++++++++++++++----
 1 file changed, 33 insertions(+), 4 deletions(-)

diff --git a/accel/tcg/ldst_atomicity.c.inc b/accel/tcg/ldst_atomicity.c.inc
index 2426b09aef..7ed5d4282d 100644
--- a/accel/tcg/ldst_atomicity.c.inc
+++ b/accel/tcg/ldst_atomicity.c.inc
@@ -41,6 +41,8 @@
  * but we're using tcg/tci/ instead.
  */
 # define HAVE_al16_fast    false
+#elif defined(__aarch64__)
+# define HAVE_al16_fast    likely(have_lse2)
 #elif defined(__x86_64__) && defined(CONFIG_INT128)
 # define HAVE_al16_fast    likely(have_atomic16)
 #else
@@ -48,6 +50,8 @@
 #endif
 #if defined(CONFIG_ATOMIC128) || defined(CONFIG_CMPXCHG128)
 # define HAVE_al16         true
+#elif defined(__aarch64__)
+# define HAVE_al16         true
 #else
 # define HAVE_al16         false
 #endif
@@ -170,6 +174,14 @@ load_atomic16(void *pv)
 
     r.u = qatomic_read__nocheck(p);
     return r.s;
+#elif defined(__aarch64__)
+    uint64_t l, h;
+
+    /* Via HAVE_al16_fast, FEAT_LSE2 is present: LDP becomes atomic. */
+    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*(__uint128_t *)pv));
+
+    qemu_build_assert(!HOST_BIG_ENDIAN);
+    return int128_make128(l, h);
 #elif defined(__x86_64__) && defined(CONFIG_INT128)
     Int128Alias r;
 
@@ -412,6 +424,18 @@ load_atom_extract_al16_or_al8(void *pv, int s)
         r = qatomic_read__nocheck(p16);
     }
     return r >> shr;
+#elif defined(__aarch64__)
+    /*
+     * Via HAVE_al16_fast, FEAT_LSE2 is present.
+     * LDP becomes single-copy atomic if 16-byte aligned, and
+     * single-copy atomic on the parts if 8-byte aligned.
+     */
+    uintptr_t pi = (uintptr_t)pv;
+    int shr = (pi & 7) * 8;
+    uint64_t l, h;
+
+    asm("ldp %0, %1, %2" : "=r"(l), "=r"(h) : "m"(*(__uint128_t *)(pi & ~7)));
+    return (l >> shr) | (h << (-shr & 63));
 #elif defined(__x86_64__) && defined(CONFIG_INT128)
     uintptr_t pi = (uintptr_t)pv;
     int shr = (pi & 7) * 8;
@@ -767,10 +791,15 @@ store_atomic16(void *pv, Int128Alias val)
         l = int128_getlo(val.s);
         h = int128_gethi(val.s);
 
-        asm("0: ldxp %0, xzr, %1\n\t"
-            "stxp %w0, %2, %3, %1\n\t"
-            "cbnz %w0, 0b"
-            : "=&r"(t), "=Q"(*(__uint128_t *)pv) : "r"(l), "r"(h));
+        if (HAVE_al16_fast) {
+            /* Via HAVE_al16_fast, FEAT_LSE2 is present: STP becomes atomic. */
+            asm("stp %1, %2, %0" : "=Q"(*(__uint128_t *)pv) : "r"(l), "r"(h));
+        } else {
+            asm("0: ldxp %0, xzr, %1\n\t"
+                "stxp %w0, %2, %3, %1\n\t"
+                "cbnz %w0, 0b"
+                : "=&r"(t), "=Q"(*(__uint128_t *)pv) : "r"(l), "r"(h));
+        }
         return;
     }
 #elif defined(CONFIG_CMPXCHG128)
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 20/57] tcg: Introduce TCG_OPF_TYPE_MASK
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (18 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 19/57] accel/tcg: Add have_lse2 support in ldst_atomicity Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:45   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 21/57] tcg/i386: Use full load/store helpers in user-only mode Richard Henderson
                   ` (37 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Reorg TCG_OPF_64BIT and TCG_OPF_VECTOR into a two-bit field so
that we can add TCG_OPF_128BIT without requiring another bit.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/tcg/tcg.h            | 22 ++++++++++++----------
 tcg/optimize.c               | 15 ++++++++++++---
 tcg/tcg.c                    |  4 ++--
 tcg/aarch64/tcg-target.c.inc |  8 +++++---
 tcg/tci/tcg-target.c.inc     |  3 ++-
 5 files changed, 33 insertions(+), 19 deletions(-)

diff --git a/include/tcg/tcg.h b/include/tcg/tcg.h
index b19e167e1d..efbd891f87 100644
--- a/include/tcg/tcg.h
+++ b/include/tcg/tcg.h
@@ -932,24 +932,26 @@ typedef struct TCGArgConstraint {
 
 /* Bits for TCGOpDef->flags, 8 bits available, all used.  */
 enum {
+    /* Two bits describing the output type. */
+    TCG_OPF_TYPE_MASK    = 0x03,
+    TCG_OPF_32BIT        = 0x00,
+    TCG_OPF_64BIT        = 0x01,
+    TCG_OPF_VECTOR       = 0x02,
+    TCG_OPF_128BIT       = 0x03,
     /* Instruction exits the translation block.  */
-    TCG_OPF_BB_EXIT      = 0x01,
+    TCG_OPF_BB_EXIT      = 0x04,
     /* Instruction defines the end of a basic block.  */
-    TCG_OPF_BB_END       = 0x02,
+    TCG_OPF_BB_END       = 0x08,
     /* Instruction clobbers call registers and potentially update globals.  */
-    TCG_OPF_CALL_CLOBBER = 0x04,
+    TCG_OPF_CALL_CLOBBER = 0x10,
     /* Instruction has side effects: it cannot be removed if its outputs
        are not used, and might trigger exceptions.  */
-    TCG_OPF_SIDE_EFFECTS = 0x08,
-    /* Instruction operands are 64-bits (otherwise 32-bits).  */
-    TCG_OPF_64BIT        = 0x10,
+    TCG_OPF_SIDE_EFFECTS = 0x20,
     /* Instruction is optional and not implemented by the host, or insn
        is generic and should not be implemened by the host.  */
-    TCG_OPF_NOT_PRESENT  = 0x20,
-    /* Instruction operands are vectors.  */
-    TCG_OPF_VECTOR       = 0x40,
+    TCG_OPF_NOT_PRESENT  = 0x40,
     /* Instruction is a conditional branch. */
-    TCG_OPF_COND_BRANCH  = 0x80
+    TCG_OPF_COND_BRANCH  = 0x80,
 };
 
 typedef struct TCGOpDef {
diff --git a/tcg/optimize.c b/tcg/optimize.c
index 9614fa3638..37d46f2a1f 100644
--- a/tcg/optimize.c
+++ b/tcg/optimize.c
@@ -2051,12 +2051,21 @@ void tcg_optimize(TCGContext *s)
         copy_propagate(&ctx, op, def->nb_oargs, def->nb_iargs);
 
         /* Pre-compute the type of the operation. */
-        if (def->flags & TCG_OPF_VECTOR) {
+        switch (def->flags & TCG_OPF_TYPE_MASK) {
+        case TCG_OPF_VECTOR:
             ctx.type = TCG_TYPE_V64 + TCGOP_VECL(op);
-        } else if (def->flags & TCG_OPF_64BIT) {
+            break;
+        case TCG_OPF_128BIT:
+            ctx.type = TCG_TYPE_I128;
+            break;
+        case TCG_OPF_64BIT:
             ctx.type = TCG_TYPE_I64;
-        } else {
+            break;
+        case TCG_OPF_32BIT:
             ctx.type = TCG_TYPE_I32;
+            break;
+        default:
+            qemu_build_not_reached();
         }
 
         /* Assume all bits affected, no bits known zero, no sign reps. */
diff --git a/tcg/tcg.c b/tcg/tcg.c
index d0afabf194..cb5ca9b612 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -2294,7 +2294,7 @@ static void tcg_dump_ops(TCGContext *s, FILE *f, bool have_prefs)
             nb_iargs = def->nb_iargs;
             nb_cargs = def->nb_cargs;
 
-            if (def->flags & TCG_OPF_VECTOR) {
+            if ((def->flags & TCG_OPF_TYPE_MASK) == TCG_OPF_VECTOR) {
                 col += ne_fprintf(f, "v%d,e%d,", 64 << TCGOP_VECL(op),
                                   8 << TCGOP_VECE(op));
             }
@@ -4782,7 +4782,7 @@ static void tcg_reg_alloc_op(TCGContext *s, const TCGOp *op)
         tcg_out_extrl_i64_i32(s, new_args[0], new_args[1]);
         break;
     default:
-        if (def->flags & TCG_OPF_VECTOR) {
+        if ((def->flags & TCG_OPF_TYPE_MASK) == TCG_OPF_VECTOR) {
             tcg_out_vec_op(s, op->opc, TCGOP_VECL(op), TCGOP_VECE(op),
                            new_args, const_args);
         } else {
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index 3adc5fd3a3..43acb4fbcb 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -1921,9 +1921,11 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                        const TCGArg args[TCG_MAX_OP_ARGS],
                        const int const_args[TCG_MAX_OP_ARGS])
 {
-    /* 99% of the time, we can signal the use of extension registers
-       by looking to see if the opcode handles 64-bit data.  */
-    TCGType ext = (tcg_op_defs[opc].flags & TCG_OPF_64BIT) != 0;
+    /*
+     * 99% of the time, we can signal the use of extension registers
+     * by looking to see if the opcode handles 32-bit data or not.
+     */
+    TCGType ext = (tcg_op_defs[opc].flags & TCG_OPF_TYPE_MASK) != TCG_OPF_32BIT;
 
     /* Hoist the loads of the most common arguments.  */
     TCGArg a0 = args[0];
diff --git a/tcg/tci/tcg-target.c.inc b/tcg/tci/tcg-target.c.inc
index 4cf03a579c..e31640d109 100644
--- a/tcg/tci/tcg-target.c.inc
+++ b/tcg/tci/tcg-target.c.inc
@@ -790,7 +790,8 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     CASE_32_64(sextract) /* Optional (TCG_TARGET_HAS_sextract_*). */
         {
             TCGArg pos = args[2], len = args[3];
-            TCGArg max = tcg_op_defs[opc].flags & TCG_OPF_64BIT ? 64 : 32;
+            TCGArg max = ((tcg_op_defs[opc].flags & TCG_OPF_TYPE_MASK)
+                          == TCG_OPF_32BIT ? 32 : 64);
 
             tcg_debug_assert(pos < max);
             tcg_debug_assert(pos + len <= max);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 21/57] tcg/i386: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (19 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 20/57] tcg: Introduce TCG_OPF_TYPE_MASK Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:01   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 22/57] tcg/aarch64: " Richard Henderson
                   ` (36 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.c.inc | 52 +++------------------------------------
 1 file changed, 4 insertions(+), 48 deletions(-)

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index f838683fc3..e78d4d4aa7 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -1776,7 +1776,6 @@ typedef struct {
     int seg;
 } HostAddress;
 
-#if defined(CONFIG_SOFTMMU)
 /*
  * Because i686 has no register parameters and because x86_64 has xchg
  * to handle addr/data register overlap, we have placed all input arguments
@@ -1812,7 +1811,7 @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
     /* resolve label address */
     tcg_patch32(label_ptr[0], s->code_ptr - label_ptr[0] - 4);
-    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+    if (label_ptr[1]) {
         tcg_patch32(label_ptr[1], s->code_ptr - label_ptr[1] - 4);
     }
 
@@ -1834,7 +1833,7 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
     /* resolve label address */
     tcg_patch32(label_ptr[0], s->code_ptr - label_ptr[0] - 4);
-    if (TARGET_LONG_BITS > TCG_TARGET_REG_BITS) {
+    if (label_ptr[1]) {
         tcg_patch32(label_ptr[1], s->code_ptr - label_ptr[1] - 4);
     }
 
@@ -1844,51 +1843,8 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     tcg_out_jmp(s, l->raddr);
     return true;
 }
-#else
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    /* resolve label address */
-    tcg_patch32(l->label_ptr[0], s->code_ptr - l->label_ptr[0] - 4);
-
-    if (TCG_TARGET_REG_BITS == 32) {
-        int ofs = 0;
-
-        tcg_out_st(s, TCG_TYPE_PTR, TCG_AREG0, TCG_REG_ESP, ofs);
-        ofs += 4;
-
-        tcg_out_st(s, TCG_TYPE_I32, l->addrlo_reg, TCG_REG_ESP, ofs);
-        ofs += 4;
-        if (TARGET_LONG_BITS == 64) {
-            tcg_out_st(s, TCG_TYPE_I32, l->addrhi_reg, TCG_REG_ESP, ofs);
-            ofs += 4;
-        }
-
-        tcg_out_pushi(s, (uintptr_t)l->raddr);
-    } else {
-        tcg_out_mov(s, TCG_TYPE_TL, tcg_target_call_iarg_regs[1],
-                    l->addrlo_reg);
-        tcg_out_mov(s, TCG_TYPE_PTR, tcg_target_call_iarg_regs[0], TCG_AREG0);
-
-        tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RAX, (uintptr_t)l->raddr);
-        tcg_out_push(s, TCG_REG_RAX);
-    }
-
-    /* "Tail call" to the helper, with the return address back inline. */
-    tcg_out_jmp(s, (const void *)(l->is_ld ? helper_unaligned_ld
-                                  : helper_unaligned_st));
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
 
+#ifndef CONFIG_SOFTMMU
 static HostAddress x86_guest_base = {
     .index = -1
 };
@@ -1920,7 +1876,7 @@ static inline int setup_guest_base_seg(void)
     return 0;
 }
 #endif /* setup_guest_base_seg */
-#endif /* SOFTMMU */
+#endif /* !SOFTMMU */
 
 /*
  * For softmmu, perform the TLB load and compare.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 22/57] tcg/aarch64: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (20 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 21/57] tcg/i386: Use full load/store helpers in user-only mode Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:06   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 23/57] tcg/ppc: " Richard Henderson
                   ` (35 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 35 -----------------------------------
 1 file changed, 35 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index 43acb4fbcb..09c9ecad0f 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -1595,7 +1595,6 @@ typedef struct {
     TCGType index_ext;
 } HostAddress;
 
-#ifdef CONFIG_SOFTMMU
 static const TCGLdstHelperParam ldst_helper_param = {
     .ntmp = 1, .tmp = { TCG_REG_TMP }
 };
@@ -1628,40 +1627,6 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     tcg_out_goto(s, lb->raddr);
     return true;
 }
-#else
-static void tcg_out_adr(TCGContext *s, TCGReg rd, const void *target)
-{
-    ptrdiff_t offset = tcg_pcrel_diff(s, target);
-    tcg_debug_assert(offset == sextract64(offset, 0, 21));
-    tcg_out_insn(s, 3406, ADR, rd, offset);
-}
-
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    if (!reloc_pc19(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-        return false;
-    }
-
-    tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_X1, l->addrlo_reg);
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_X0, TCG_AREG0);
-
-    /* "Tail call" to the helper, with the return address back inline. */
-    tcg_out_adr(s, TCG_REG_LR, l->raddr);
-    tcg_out_goto_long(s, (const void *)(l->is_ld ? helper_unaligned_ld
-                                        : helper_unaligned_st));
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-#endif /* CONFIG_SOFTMMU */
 
 /*
  * For softmmu, perform the TLB load and compare.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 23/57] tcg/ppc: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (21 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 22/57] tcg/aarch64: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:07   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 24/57] tcg/loongarch64: " Richard Henderson
                   ` (34 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.c.inc | 44 ----------------------------------------
 1 file changed, 44 deletions(-)

diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 0963156a78..733f67c7a5 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -1962,7 +1962,6 @@ static const uint32_t qemu_stx_opc[(MO_SIZE + MO_BSWAP) + 1] = {
     [MO_BSWAP | MO_UQ] = STDBRX,
 };
 
-#if defined (CONFIG_SOFTMMU)
 static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
 {
     if (arg < 0) {
@@ -2012,49 +2011,6 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     tcg_out_b(s, 0, lb->raddr);
     return true;
 }
-#else
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    if (!reloc_pc14(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-        return false;
-    }
-
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        TCGReg arg = TCG_REG_R4;
-
-        arg |= (TCG_TARGET_CALL_ARG_I64 == TCG_CALL_ARG_EVEN);
-        if (l->addrlo_reg != arg) {
-            tcg_out_mov(s, TCG_TYPE_I32, arg, l->addrhi_reg);
-            tcg_out_mov(s, TCG_TYPE_I32, arg + 1, l->addrlo_reg);
-        } else if (l->addrhi_reg != arg + 1) {
-            tcg_out_mov(s, TCG_TYPE_I32, arg + 1, l->addrlo_reg);
-            tcg_out_mov(s, TCG_TYPE_I32, arg, l->addrhi_reg);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R0, arg);
-            tcg_out_mov(s, TCG_TYPE_I32, arg, arg + 1);
-            tcg_out_mov(s, TCG_TYPE_I32, arg + 1, TCG_REG_R0);
-        }
-    } else {
-        tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_R4, l->addrlo_reg);
-    }
-    tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_R3, TCG_AREG0);
-
-    /* "Tail call" to the helper, with the return address back inline. */
-    tcg_out_call_int(s, 0, (const void *)(l->is_ld ? helper_unaligned_ld
-                                          : helper_unaligned_st));
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-#endif /* SOFTMMU */
 
 typedef struct {
     TCGReg base;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 24/57] tcg/loongarch64: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (22 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 23/57] tcg/ppc: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:07   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 25/57] tcg/riscv: " Richard Henderson
                   ` (33 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/loongarch64/tcg-target.c.inc | 30 ------------------------------
 1 file changed, 30 deletions(-)

diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index d1bc29826f..e651ec5c71 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -783,7 +783,6 @@ static bool tcg_out_sti(TCGContext *s, TCGType type, TCGArg val,
  * Load/store helpers for SoftMMU, and qemu_ld/st implementations
  */
 
-#if defined(CONFIG_SOFTMMU)
 static bool tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
 {
     tcg_out_opc_b(s, 0);
@@ -822,35 +821,6 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     tcg_out_call_int(s, qemu_st_helpers[opc & MO_SIZE], false);
     return tcg_out_goto(s, l->raddr);
 }
-#else
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    /* resolve label address */
-    if (!reloc_br_sk16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-        return false;
-    }
-
-    tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_A1, l->addrlo_reg);
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
-
-    /* tail call, with the return address back inline. */
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RA, (uintptr_t)l->raddr);
-    tcg_out_call_int(s, (const void *)(l->is_ld ? helper_unaligned_ld
-                                       : helper_unaligned_st), true);
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-#endif /* CONFIG_SOFTMMU */
 
 typedef struct {
     TCGReg base;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 25/57] tcg/riscv: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (23 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 24/57] tcg/loongarch64: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:07   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 26/57] tcg/arm: Adjust constraints on qemu_ld/st Richard Henderson
                   ` (32 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/riscv/tcg-target.c.inc | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index 8ed0e2f210..19cd4507fb 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -846,7 +846,6 @@ static void tcg_out_mb(TCGContext *s, TCGArg a0)
  * Load/store and TLB
  */
 
-#if defined(CONFIG_SOFTMMU)
 static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
 {
     tcg_out_opc_jump(s, OPC_JAL, TCG_REG_ZERO, 0);
@@ -893,34 +892,6 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     tcg_out_goto(s, l->raddr);
     return true;
 }
-#else
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    /* resolve label address */
-    if (!reloc_sbimm12(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-        return false;
-    }
-
-    tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_A1, l->addrlo_reg);
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
-
-    /* tail call, with the return address back inline. */
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_RA, (uintptr_t)l->raddr);
-    tcg_out_call_int(s, (const void *)(l->is_ld ? helper_unaligned_ld
-                                       : helper_unaligned_st), true);
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-#endif /* CONFIG_SOFTMMU */
 
 /*
  * For softmmu, perform the TLB load and compare.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 26/57] tcg/arm: Adjust constraints on qemu_ld/st
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (24 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 25/57] tcg/riscv: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:14   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 27/57] tcg/arm: Use full load/store helpers in user-only mode Richard Henderson
                   ` (31 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Always reserve r3 for tlb softmmu lookup.  Fix a bug in user-only
ALL_QLDST_REGS, in that r14 is clobbered by the BLNE that leads
to the misaligned trap.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/arm/tcg-target-con-set.h | 16 ++++++++--------
 tcg/arm/tcg-target-con-str.h |  5 ++---
 tcg/arm/tcg-target.c.inc     | 23 ++++++++---------------
 3 files changed, 18 insertions(+), 26 deletions(-)

diff --git a/tcg/arm/tcg-target-con-set.h b/tcg/arm/tcg-target-con-set.h
index b8849b2478..229ae258ac 100644
--- a/tcg/arm/tcg-target-con-set.h
+++ b/tcg/arm/tcg-target-con-set.h
@@ -12,19 +12,19 @@
 C_O0_I1(r)
 C_O0_I2(r, r)
 C_O0_I2(r, rIN)
-C_O0_I2(s, s)
+C_O0_I2(q, q)
 C_O0_I2(w, r)
-C_O0_I3(s, s, s)
-C_O0_I3(S, p, s)
+C_O0_I3(q, q, q)
+C_O0_I3(Q, p, q)
 C_O0_I4(r, r, rI, rI)
-C_O0_I4(S, p, s, s)
-C_O1_I1(r, l)
+C_O0_I4(Q, p, q, q)
+C_O1_I1(r, q)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
 C_O1_I1(w, w)
 C_O1_I1(w, wr)
 C_O1_I2(r, 0, rZ)
-C_O1_I2(r, l, l)
+C_O1_I2(r, q, q)
 C_O1_I2(r, r, r)
 C_O1_I2(r, r, rI)
 C_O1_I2(r, r, rIK)
@@ -39,8 +39,8 @@ C_O1_I2(w, w, wZ)
 C_O1_I3(w, w, w, w)
 C_O1_I4(r, r, r, rI, rI)
 C_O1_I4(r, r, rIN, rIK, 0)
-C_O2_I1(e, p, l)
-C_O2_I2(e, p, l, l)
+C_O2_I1(e, p, q)
+C_O2_I2(e, p, q, q)
 C_O2_I2(r, r, r, r)
 C_O2_I4(r, r, r, r, rIN, rIK)
 C_O2_I4(r, r, rI, rI, rIN, rIK)
diff --git a/tcg/arm/tcg-target-con-str.h b/tcg/arm/tcg-target-con-str.h
index 24b4b59feb..f83f1d3919 100644
--- a/tcg/arm/tcg-target-con-str.h
+++ b/tcg/arm/tcg-target-con-str.h
@@ -10,9 +10,8 @@
  */
 REGS('e', ALL_GENERAL_REGS & 0x5555) /* even regs */
 REGS('r', ALL_GENERAL_REGS)
-REGS('l', ALL_QLOAD_REGS)
-REGS('s', ALL_QSTORE_REGS)
-REGS('S', ALL_QSTORE_REGS & 0x5555)  /* even qstore */
+REGS('q', ALL_QLDST_REGS)
+REGS('Q', ALL_QLDST_REGS & 0x5555)   /* even qldst */
 REGS('w', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index 8b0d526659..a02804dd69 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -353,23 +353,16 @@ static bool patch_reloc(tcg_insn_unit *code_ptr, int type,
 #define ALL_VECTOR_REGS   0xffff0000u
 
 /*
- * r0-r2 will be overwritten when reading the tlb entry (softmmu only)
- * and r0-r1 doing the byte swapping, so don't use these.
- * r3 is removed for softmmu to avoid clashes with helper arguments.
+ * r0-r3 will be overwritten when reading the tlb entry (softmmu only);
+ * r14 will be overwritten by the BLNE branching to the slow path.
  */
 #ifdef CONFIG_SOFTMMU
-#define ALL_QLOAD_REGS \
+#define ALL_QLDST_REGS \
     (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1) | \
                           (1 << TCG_REG_R2) | (1 << TCG_REG_R3) | \
                           (1 << TCG_REG_R14)))
-#define ALL_QSTORE_REGS \
-    (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1) | \
-                          (1 << TCG_REG_R2) | (1 << TCG_REG_R14) | \
-                          ((TARGET_LONG_BITS == 64) << TCG_REG_R3)))
 #else
-#define ALL_QLOAD_REGS   ALL_GENERAL_REGS
-#define ALL_QSTORE_REGS \
-    (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1)))
+#define ALL_QLDST_REGS   (ALL_GENERAL_REGS & ~(1 << TCG_REG_R14))
 #endif
 
 /*
@@ -2203,13 +2196,13 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
         return C_O1_I4(r, r, r, rI, rI);
 
     case INDEX_op_qemu_ld_i32:
-        return TARGET_LONG_BITS == 32 ? C_O1_I1(r, l) : C_O1_I2(r, l, l);
+        return TARGET_LONG_BITS == 32 ? C_O1_I1(r, q) : C_O1_I2(r, q, q);
     case INDEX_op_qemu_ld_i64:
-        return TARGET_LONG_BITS == 32 ? C_O2_I1(e, p, l) : C_O2_I2(e, p, l, l);
+        return TARGET_LONG_BITS == 32 ? C_O2_I1(e, p, q) : C_O2_I2(e, p, q, q);
     case INDEX_op_qemu_st_i32:
-        return TARGET_LONG_BITS == 32 ? C_O0_I2(s, s) : C_O0_I3(s, s, s);
+        return TARGET_LONG_BITS == 32 ? C_O0_I2(q, q) : C_O0_I3(q, q, q);
     case INDEX_op_qemu_st_i64:
-        return TARGET_LONG_BITS == 32 ? C_O0_I3(S, p, s) : C_O0_I4(S, p, s, s);
+        return TARGET_LONG_BITS == 32 ? C_O0_I3(Q, p, q) : C_O0_I4(Q, p, q, q);
 
     case INDEX_op_st_vec:
         return C_O0_I2(w, r);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 27/57] tcg/arm: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (25 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 26/57] tcg/arm: Adjust constraints on qemu_ld/st Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:15   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 28/57] tcg/mips: " Richard Henderson
                   ` (30 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/arm/tcg-target.c.inc | 45 ----------------------------------------
 1 file changed, 45 deletions(-)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index a02804dd69..eb0542f32e 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -1325,7 +1325,6 @@ typedef struct {
     bool index_scratch;
 } HostAddress;
 
-#ifdef CONFIG_SOFTMMU
 static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
 {
     /* We arrive at the slow path via "BLNE", so R14 contains l->raddr. */
@@ -1368,50 +1367,6 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     tcg_out_goto(s, COND_AL, qemu_st_helpers[opc & MO_SIZE]);
     return true;
 }
-#else
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    if (!reloc_pc24(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-        return false;
-    }
-
-    if (TARGET_LONG_BITS == 64) {
-        /* 64-bit target address is aligned into R2:R3. */
-        TCGMovExtend ext[2] = {
-            { .dst = TCG_REG_R2, .dst_type = TCG_TYPE_I32,
-              .src = l->addrlo_reg,
-              .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
-            { .dst = TCG_REG_R3, .dst_type = TCG_TYPE_I32,
-              .src = l->addrhi_reg,
-              .src_type = TCG_TYPE_I32, .src_ext = MO_UL },
-        };
-        tcg_out_movext2(s, &ext[0], &ext[1], TCG_REG_TMP);
-    } else {
-        tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_R1, l->addrlo_reg);
-    }
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_R0, TCG_AREG0);
-
-    /*
-     * Tail call to the helper, with the return address back inline,
-     * just for the clarity of the debugging traceback -- the helper
-     * cannot return.  We have used BLNE to arrive here, so LR is
-     * already set.
-     */
-    tcg_out_goto(s, COND_AL, (const void *)
-                 (l->is_ld ? helper_unaligned_ld : helper_unaligned_st));
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-#endif /* SOFTMMU */
 
 static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
                                            TCGReg addrlo, TCGReg addrhi,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 28/57] tcg/mips: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (26 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 27/57] tcg/arm: Use full load/store helpers in user-only mode Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:15   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 29/57] tcg/s390x: " Richard Henderson
                   ` (29 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/mips/tcg-target.c.inc | 57 ++-------------------------------------
 1 file changed, 2 insertions(+), 55 deletions(-)

diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index 7770ef46bd..fa0f334e8d 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -1075,7 +1075,6 @@ static void tcg_out_call(TCGContext *s, const tcg_insn_unit *arg,
     tcg_out_nop(s);
 }
 
-#if defined(CONFIG_SOFTMMU)
 /* We have four temps, we might as well expose three of them. */
 static const TCGLdstHelperParam ldst_helper_param = {
     .ntmp = 3, .tmp = { TCG_TMP0, TCG_TMP1, TCG_TMP2 }
@@ -1088,8 +1087,7 @@ static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
     /* resolve label address */
     if (!reloc_pc16(l->label_ptr[0], tgt_rx)
-        || (TCG_TARGET_REG_BITS < TARGET_LONG_BITS
-            && !reloc_pc16(l->label_ptr[1], tgt_rx))) {
+        || (l->label_ptr[1] && !reloc_pc16(l->label_ptr[1], tgt_rx))) {
         return false;
     }
 
@@ -1118,8 +1116,7 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 
     /* resolve label address */
     if (!reloc_pc16(l->label_ptr[0], tgt_rx)
-        || (TCG_TARGET_REG_BITS < TARGET_LONG_BITS
-            && !reloc_pc16(l->label_ptr[1], tgt_rx))) {
+        || (l->label_ptr[1] && !reloc_pc16(l->label_ptr[1], tgt_rx))) {
         return false;
     }
 
@@ -1139,56 +1136,6 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
     return true;
 }
 
-#else
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    void *target;
-
-    if (!reloc_pc16(l->label_ptr[0], tcg_splitwx_to_rx(s->code_ptr))) {
-        return false;
-    }
-
-    if (TCG_TARGET_REG_BITS < TARGET_LONG_BITS) {
-        /* A0 is env, A1 is skipped, A2:A3 is the uint64_t address. */
-        TCGReg a2 = MIPS_BE ? l->addrhi_reg : l->addrlo_reg;
-        TCGReg a3 = MIPS_BE ? l->addrlo_reg : l->addrhi_reg;
-
-        if (a3 != TCG_REG_A2) {
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A2, a2);
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A3, a3);
-        } else if (a2 != TCG_REG_A3) {
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A3, a3);
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A2, a2);
-        } else {
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_TMP0, TCG_REG_A2);
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A2, TCG_REG_A3);
-            tcg_out_mov(s, TCG_TYPE_I32, TCG_REG_A3, TCG_TMP0);
-        }
-    } else {
-        tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_A1, l->addrlo_reg);
-    }
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_A0, TCG_AREG0);
-
-    /*
-     * Tail call to the helper, with the return address back inline.
-     * We have arrived here via BNEL, so $31 is already set.
-     */
-    target = (l->is_ld ? helper_unaligned_ld : helper_unaligned_st);
-    tcg_out_call_int(s, target, true);
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-#endif /* SOFTMMU */
-
 typedef struct {
     TCGReg base;
     MemOp align;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 29/57] tcg/s390x: Use full load/store helpers in user-only mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (27 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 28/57] tcg/mips: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:16   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary Richard Henderson
                   ` (28 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
This will allow the fast path to increase alignment to implement atomicity
while not immediately raising an alignment exception.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 29 -----------------------------
 1 file changed, 29 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 968977be98..de8aed5f77 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -1679,7 +1679,6 @@ static void tcg_out_qemu_st_direct(TCGContext *s, MemOp opc, TCGReg data,
     }
 }
 
-#if defined(CONFIG_SOFTMMU)
 static const TCGLdstHelperParam ldst_helper_param = {
     .ntmp = 1, .tmp = { TCG_TMP0 }
 };
@@ -1716,34 +1715,6 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
     tgen_gotoi(s, S390_CC_ALWAYS, lb->raddr);
     return true;
 }
-#else
-static bool tcg_out_fail_alignment(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    if (!patch_reloc(l->label_ptr[0], R_390_PC16DBL,
-                     (intptr_t)tcg_splitwx_to_rx(s->code_ptr), 2)) {
-        return false;
-    }
-
-    tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_R3, l->addrlo_reg);
-    tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_R2, TCG_AREG0);
-
-    /* "Tail call" to the helper, with the return address back inline. */
-    tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R14, (uintptr_t)l->raddr);
-    tgen_gotoi(s, S390_CC_ALWAYS, (const void *)(l->is_ld ? helper_unaligned_ld
-                                                 : helper_unaligned_st));
-    return true;
-}
-
-static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-
-static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
-{
-    return tcg_out_fail_alignment(s, l);
-}
-#endif /* CONFIG_SOFTMMU */
 
 /*
  * For softmmu, perform the TLB load and compare.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (28 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 29/57] tcg/s390x: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:19   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 31/57] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13 Richard Henderson
                   ` (27 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc64/tcg-target.c.inc | 15 +++++++--------
 1 file changed, 7 insertions(+), 8 deletions(-)

diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index e997db2645..64464ab363 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -83,9 +83,10 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
 #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
 
-/* Define some temporary registers.  T2 is used for constant generation.  */
+/* Define some temporary registers.  T3 is used for constant generation.  */
 #define TCG_REG_T1  TCG_REG_G1
-#define TCG_REG_T2  TCG_REG_O7
+#define TCG_REG_T2  TCG_REG_G2
+#define TCG_REG_T3  TCG_REG_O7
 
 #ifndef CONFIG_SOFTMMU
 # define TCG_GUEST_BASE_REG TCG_REG_I5
@@ -110,7 +111,6 @@ static const int tcg_target_reg_alloc_order[] = {
     TCG_REG_I4,
     TCG_REG_I5,
 
-    TCG_REG_G2,
     TCG_REG_G3,
     TCG_REG_G4,
     TCG_REG_G5,
@@ -492,8 +492,8 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
 static void tcg_out_movi(TCGContext *s, TCGType type,
                          TCGReg ret, tcg_target_long arg)
 {
-    tcg_debug_assert(ret != TCG_REG_T2);
-    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
+    tcg_debug_assert(ret != TCG_REG_T3);
+    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T3);
 }
 
 static void tcg_out_ext8s(TCGContext *s, TCGType type, TCGReg rd, TCGReg rs)
@@ -885,10 +885,8 @@ static void tcg_out_jmpl_const(TCGContext *s, const tcg_insn_unit *dest,
 {
     uintptr_t desti = (uintptr_t)dest;
 
-    /* Be careful not to clobber %o7 for a tail call. */
     tcg_out_movi_int(s, TCG_TYPE_PTR, TCG_REG_T1,
-                     desti & ~0xfff, in_prologue,
-                     tail_call ? TCG_REG_G2 : TCG_REG_O7);
+                     desti & ~0xfff, in_prologue, TCG_REG_T2);
     tcg_out_arithi(s, tail_call ? TCG_REG_G0 : TCG_REG_O7,
                    TCG_REG_T1, desti & 0xfff, JMPL);
 }
@@ -1856,6 +1854,7 @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_O6); /* stack pointer */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_T1); /* for internal use */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_T2); /* for internal use */
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_T3); /* for internal use */
 }
 
 #define ELF_HOST_MACHINE  EM_SPARCV9
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 31/57] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (29 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:20   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 32/57] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32 Richard Henderson
                   ` (26 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Emphasize that the constant is signed.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc64/tcg-target.c.inc | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index 64464ab363..2e6127d506 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -399,7 +399,7 @@ static void tcg_out_sethi(TCGContext *s, TCGReg ret, uint32_t arg)
     tcg_out32(s, SETHI | INSN_RD(ret) | ((arg & 0xfffffc00) >> 10));
 }
 
-static void tcg_out_movi_imm13(TCGContext *s, TCGReg ret, int32_t arg)
+static void tcg_out_movi_s13(TCGContext *s, TCGReg ret, int32_t arg)
 {
     tcg_out_arithi(s, ret, TCG_REG_G0, arg, ARITH_OR);
 }
@@ -408,7 +408,7 @@ static void tcg_out_movi_imm32(TCGContext *s, TCGReg ret, int32_t arg)
 {
     if (check_fit_i32(arg, 13)) {
         /* A 13-bit constant sign-extended to 64-bits.  */
-        tcg_out_movi_imm13(s, ret, arg);
+        tcg_out_movi_s13(s, ret, arg);
     } else {
         /* A 32-bit constant zero-extended to 64 bits.  */
         tcg_out_sethi(s, ret, arg);
@@ -425,15 +425,15 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
     tcg_target_long hi, lo = (int32_t)arg;
     tcg_target_long test, lsb;
 
-    /* A 32-bit constant, or 32-bit zero-extended to 64-bits.  */
-    if (type == TCG_TYPE_I32 || arg == (uint32_t)arg) {
-        tcg_out_movi_imm32(s, ret, arg);
+    /* A 13-bit constant sign-extended to 64-bits.  */
+    if (check_fit_tl(arg, 13)) {
+        tcg_out_movi_s13(s, ret, arg);
         return;
     }
 
-    /* A 13-bit constant sign-extended to 64-bits.  */
-    if (check_fit_tl(arg, 13)) {
-        tcg_out_movi_imm13(s, ret, arg);
+    /* A 32-bit constant, or 32-bit zero-extended to 64-bits.  */
+    if (type == TCG_TYPE_I32 || arg == (uint32_t)arg) {
+        tcg_out_movi_imm32(s, ret, arg);
         return;
     }
 
@@ -767,7 +767,7 @@ static void tcg_out_setcond_i32(TCGContext *s, TCGCond cond, TCGReg ret,
 
     default:
         tcg_out_cmp(s, c1, c2, c2const);
-        tcg_out_movi_imm13(s, ret, 0);
+        tcg_out_movi_s13(s, ret, 0);
         tcg_out_movcc(s, cond, MOVCC_ICC, ret, 1, 1);
         return;
     }
@@ -803,11 +803,11 @@ static void tcg_out_setcond_i64(TCGContext *s, TCGCond cond, TCGReg ret,
     /* For 64-bit signed comparisons vs zero, we can avoid the compare
        if the input does not overlap the output.  */
     if (c2 == 0 && !is_unsigned_cond(cond) && c1 != ret) {
-        tcg_out_movi_imm13(s, ret, 0);
+        tcg_out_movi_s13(s, ret, 0);
         tcg_out_movr(s, cond, ret, c1, 1, 1);
     } else {
         tcg_out_cmp(s, c1, c2, c2const);
-        tcg_out_movi_imm13(s, ret, 0);
+        tcg_out_movi_s13(s, ret, 0);
         tcg_out_movcc(s, cond, MOVCC_XCC, ret, 1, 1);
     }
 }
@@ -844,7 +844,7 @@ static void tcg_out_addsub2_i64(TCGContext *s, TCGReg rl, TCGReg rh,
     if (use_vis3_instructions && !is_sub) {
         /* Note that ADDXC doesn't accept immediates.  */
         if (bhconst && bh != 0) {
-           tcg_out_movi_imm13(s, TCG_REG_T2, bh);
+           tcg_out_movi_s13(s, TCG_REG_T2, bh);
            bh = TCG_REG_T2;
         }
         tcg_out_arith(s, rh, ah, bh, ARITH_ADDXC);
@@ -866,7 +866,7 @@ static void tcg_out_addsub2_i64(TCGContext *s, TCGReg rl, TCGReg rh,
          * so the adjustment fits 12 bits.
          */
         if (bhconst) {
-            tcg_out_movi_imm13(s, TCG_REG_T2, bh + (is_sub ? -1 : 1));
+            tcg_out_movi_s13(s, TCG_REG_T2, bh + (is_sub ? -1 : 1));
         } else {
             tcg_out_arithi(s, TCG_REG_T2, bh, 1,
                            is_sub ? ARITH_SUB : ARITH_ADD);
@@ -1036,7 +1036,7 @@ static void tcg_target_qemu_prologue(TCGContext *s)
     tcg_code_gen_epilogue = tcg_splitwx_to_rx(s->code_ptr);
     tcg_out_arithi(s, TCG_REG_G0, TCG_REG_I7, 8, RETURN);
     /* delay slot */
-    tcg_out_movi_imm13(s, TCG_REG_O0, 0);
+    tcg_out_movi_s13(s, TCG_REG_O0, 0);
 
     build_trampolines(s);
 }
@@ -1430,7 +1430,7 @@ static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
 {
     if (check_fit_ptr(a0, 13)) {
         tcg_out_arithi(s, TCG_REG_G0, TCG_REG_I7, 8, RETURN);
-        tcg_out_movi_imm13(s, TCG_REG_O0, a0);
+        tcg_out_movi_s13(s, TCG_REG_O0, a0);
         return;
     } else {
         intptr_t tb_diff = tcg_tbrel_diff(s, (void *)a0);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 32/57] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (30 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 31/57] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13 Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:22   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 33/57] tcg/sparc64: Split out tcg_out_movi_s32 Richard Henderson
                   ` (25 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Emphasize that the constant is unsigned.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc64/tcg-target.c.inc | 24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index 2e6127d506..e244209890 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -399,22 +399,18 @@ static void tcg_out_sethi(TCGContext *s, TCGReg ret, uint32_t arg)
     tcg_out32(s, SETHI | INSN_RD(ret) | ((arg & 0xfffffc00) >> 10));
 }
 
+/* A 13-bit constant sign-extended to 64 bits.  */
 static void tcg_out_movi_s13(TCGContext *s, TCGReg ret, int32_t arg)
 {
     tcg_out_arithi(s, ret, TCG_REG_G0, arg, ARITH_OR);
 }
 
-static void tcg_out_movi_imm32(TCGContext *s, TCGReg ret, int32_t arg)
+/* A 32-bit constant zero-extended to 64 bits.  */
+static void tcg_out_movi_u32(TCGContext *s, TCGReg ret, uint32_t arg)
 {
-    if (check_fit_i32(arg, 13)) {
-        /* A 13-bit constant sign-extended to 64-bits.  */
-        tcg_out_movi_s13(s, ret, arg);
-    } else {
-        /* A 32-bit constant zero-extended to 64 bits.  */
-        tcg_out_sethi(s, ret, arg);
-        if (arg & 0x3ff) {
-            tcg_out_arithi(s, ret, ret, arg & 0x3ff, ARITH_OR);
-        }
+    tcg_out_sethi(s, ret, arg);
+    if (arg & 0x3ff) {
+        tcg_out_arithi(s, ret, ret, arg & 0x3ff, ARITH_OR);
     }
 }
 
@@ -433,7 +429,7 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
 
     /* A 32-bit constant, or 32-bit zero-extended to 64-bits.  */
     if (type == TCG_TYPE_I32 || arg == (uint32_t)arg) {
-        tcg_out_movi_imm32(s, ret, arg);
+        tcg_out_movi_u32(s, ret, arg);
         return;
     }
 
@@ -477,13 +473,13 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
     /* A 64-bit constant decomposed into 2 32-bit pieces.  */
     if (check_fit_i32(lo, 13)) {
         hi = (arg - lo) >> 32;
-        tcg_out_movi_imm32(s, ret, hi);
+        tcg_out_movi_u32(s, ret, hi);
         tcg_out_arithi(s, ret, ret, 32, SHIFT_SLLX);
         tcg_out_arithi(s, ret, ret, lo, ARITH_ADD);
     } else {
         hi = arg >> 32;
-        tcg_out_movi_imm32(s, ret, hi);
-        tcg_out_movi_imm32(s, scratch, lo);
+        tcg_out_movi_u32(s, ret, hi);
+        tcg_out_movi_u32(s, scratch, lo);
         tcg_out_arithi(s, ret, ret, 32, SHIFT_SLLX);
         tcg_out_arith(s, ret, ret, scratch, ARITH_OR);
     }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 33/57] tcg/sparc64: Split out tcg_out_movi_s32
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (31 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 32/57] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32 Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:23   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 34/57] tcg/sparc64: Use standard slow path for softmmu Richard Henderson
                   ` (24 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc64/tcg-target.c.inc | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index e244209890..4375a06377 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -405,6 +405,13 @@ static void tcg_out_movi_s13(TCGContext *s, TCGReg ret, int32_t arg)
     tcg_out_arithi(s, ret, TCG_REG_G0, arg, ARITH_OR);
 }
 
+/* A 32-bit constant sign-extended to 64 bits.  */
+static void tcg_out_movi_s32(TCGContext *s, TCGReg ret, int32_t arg)
+{
+    tcg_out_sethi(s, ret, ~arg);
+    tcg_out_arithi(s, ret, ret, (arg & 0x3ff) | -0x400, ARITH_XOR);
+}
+
 /* A 32-bit constant zero-extended to 64 bits.  */
 static void tcg_out_movi_u32(TCGContext *s, TCGReg ret, uint32_t arg)
 {
@@ -444,8 +451,7 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
 
     /* A 32-bit constant sign-extended to 64-bits.  */
     if (arg == lo) {
-        tcg_out_sethi(s, ret, ~arg);
-        tcg_out_arithi(s, ret, ret, (arg & 0x3ff) | -0x400, ARITH_XOR);
+        tcg_out_movi_s32(s, ret, arg);
         return;
     }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 34/57] tcg/sparc64: Use standard slow path for softmmu
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (32 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 33/57] tcg/sparc64: Split out tcg_out_movi_s32 Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:26   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 35/57] accel/tcg: Remove helper_unaligned_{ld,st} Richard Henderson
                   ` (23 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Drop the target-specific trampolines for the standard slow path.
This lets us use tcg_out_helper_{ld,st}_args, and handles the new
atomicity bits within MemOp.

At the same time, use the full load/store helpers for user-only mode.
Drop inline unaligned access support for user-only mode, as it does
not handle atomicity.

Use TCG_REG_T[1-3] in the tlb lookup, instead of TCG_REG_O[0-2].
This allows the constraints to be simplified.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc64/tcg-target-con-set.h |   2 -
 tcg/sparc64/tcg-target-con-str.h |   1 -
 tcg/sparc64/tcg-target.h         |   1 +
 tcg/sparc64/tcg-target.c.inc     | 610 +++++++++----------------------
 4 files changed, 182 insertions(+), 432 deletions(-)

diff --git a/tcg/sparc64/tcg-target-con-set.h b/tcg/sparc64/tcg-target-con-set.h
index 31e6fea1fc..434bf25072 100644
--- a/tcg/sparc64/tcg-target-con-set.h
+++ b/tcg/sparc64/tcg-target-con-set.h
@@ -12,8 +12,6 @@
 C_O0_I1(r)
 C_O0_I2(rZ, r)
 C_O0_I2(rZ, rJ)
-C_O0_I2(sZ, s)
-C_O1_I1(r, s)
 C_O1_I1(r, r)
 C_O1_I2(r, r, r)
 C_O1_I2(r, rZ, rJ)
diff --git a/tcg/sparc64/tcg-target-con-str.h b/tcg/sparc64/tcg-target-con-str.h
index 8f5c7aef97..0577ec4942 100644
--- a/tcg/sparc64/tcg-target-con-str.h
+++ b/tcg/sparc64/tcg-target-con-str.h
@@ -9,7 +9,6 @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
-REGS('s', ALL_QLDST_REGS)
 
 /*
  * Define constraint letters for constants:
diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
index ffe22b1d21..7434cc99d4 100644
--- a/tcg/sparc64/tcg-target.h
+++ b/tcg/sparc64/tcg-target.h
@@ -155,6 +155,7 @@ extern bool use_vis3_instructions;
 
 #define TCG_TARGET_DEFAULT_MO (0)
 #define TCG_TARGET_HAS_MEMORY_BSWAP     1
+#define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
 
 #endif
diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index 4375a06377..0237188d65 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -27,6 +27,7 @@
 #error "unsupported code generation mode"
 #endif
 
+#include "../tcg-ldst.c.inc"
 #include "../tcg-pool.c.inc"
 
 #ifdef CONFIG_DEBUG_TCG
@@ -70,18 +71,7 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
 #define TCG_CT_CONST_S13  0x200
 #define TCG_CT_CONST_ZERO 0x400
 
-/*
- * For softmmu, we need to avoid conflicts with the first 3
- * argument registers to perform the tlb lookup, and to call
- * the helper function.
- */
-#ifdef CONFIG_SOFTMMU
-#define SOFTMMU_RESERVE_REGS MAKE_64BIT_MASK(TCG_REG_O0, 3)
-#else
-#define SOFTMMU_RESERVE_REGS 0
-#endif
-#define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
-#define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
+#define ALL_GENERAL_REGS  MAKE_64BIT_MASK(0, 32)
 
 /* Define some temporary registers.  T3 is used for constant generation.  */
 #define TCG_REG_T1  TCG_REG_G1
@@ -918,82 +908,6 @@ static void tcg_out_mb(TCGContext *s, TCGArg a0)
     tcg_out32(s, MEMBAR | (a0 & TCG_MO_ALL));
 }
 
-#ifdef CONFIG_SOFTMMU
-static const tcg_insn_unit *qemu_ld_trampoline[MO_SSIZE + 1];
-static const tcg_insn_unit *qemu_st_trampoline[MO_SIZE + 1];
-
-static void build_trampolines(TCGContext *s)
-{
-    int i;
-
-    for (i = 0; i < ARRAY_SIZE(qemu_ld_helpers); ++i) {
-        if (qemu_ld_helpers[i] == NULL) {
-            continue;
-        }
-
-        /* May as well align the trampoline.  */
-        while ((uintptr_t)s->code_ptr & 15) {
-            tcg_out_nop(s);
-        }
-        qemu_ld_trampoline[i] = tcg_splitwx_to_rx(s->code_ptr);
-
-        /* Set the retaddr operand.  */
-        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O3, TCG_REG_O7);
-        /* Tail call.  */
-        tcg_out_jmpl_const(s, qemu_ld_helpers[i], true, true);
-        /* delay slot -- set the env argument */
-        tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
-    }
-
-    for (i = 0; i < ARRAY_SIZE(qemu_st_helpers); ++i) {
-        if (qemu_st_helpers[i] == NULL) {
-            continue;
-        }
-
-        /* May as well align the trampoline.  */
-        while ((uintptr_t)s->code_ptr & 15) {
-            tcg_out_nop(s);
-        }
-        qemu_st_trampoline[i] = tcg_splitwx_to_rx(s->code_ptr);
-
-        /* Set the retaddr operand.  */
-        tcg_out_mov(s, TCG_TYPE_PTR, TCG_REG_O4, TCG_REG_O7);
-
-        /* Tail call.  */
-        tcg_out_jmpl_const(s, qemu_st_helpers[i], true, true);
-        /* delay slot -- set the env argument */
-        tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
-    }
-}
-#else
-static const tcg_insn_unit *qemu_unalign_ld_trampoline;
-static const tcg_insn_unit *qemu_unalign_st_trampoline;
-
-static void build_trampolines(TCGContext *s)
-{
-    for (int ld = 0; ld < 2; ++ld) {
-        void *helper;
-
-        while ((uintptr_t)s->code_ptr & 15) {
-            tcg_out_nop(s);
-        }
-
-        if (ld) {
-            helper = helper_unaligned_ld;
-            qemu_unalign_ld_trampoline = tcg_splitwx_to_rx(s->code_ptr);
-        } else {
-            helper = helper_unaligned_st;
-            qemu_unalign_st_trampoline = tcg_splitwx_to_rx(s->code_ptr);
-        }
-
-        /* Tail call.  */
-        tcg_out_jmpl_const(s, helper, true, true);
-        /* delay slot -- set the env argument */
-        tcg_out_mov_delay(s, TCG_REG_O0, TCG_AREG0);
-    }
-}
-#endif
-
 /* Generate global QEMU prologue and epilogue code */
 static void tcg_target_qemu_prologue(TCGContext *s)
 {
@@ -1039,8 +953,6 @@ static void tcg_target_qemu_prologue(TCGContext *s)
     tcg_out_arithi(s, TCG_REG_G0, TCG_REG_I7, 8, RETURN);
     /* delay slot */
     tcg_out_movi_s13(s, TCG_REG_O0, 0);
-
-    build_trampolines(s);
 }
 
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
@@ -1051,381 +963,224 @@ static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
     }
 }
 
-#if defined(CONFIG_SOFTMMU)
+static const TCGLdstHelperParam ldst_helper_param = {
+    .ntmp = 1, .tmp = { TCG_REG_T1 }
+};
 
-/* We expect to use a 13-bit negative offset from ENV.  */
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
-QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 12));
-
-/* Perform the TLB load and compare.
-
-   Inputs:
-   ADDRLO and ADDRHI contain the possible two parts of the address.
-
-   MEM_INDEX and S_BITS are the memory context and log2 size of the load.
-
-   WHICH is the offset into the CPUTLBEntry structure of the slot to read.
-   This should be offsetof addr_read or addr_write.
-
-   The result of the TLB comparison is in %[ix]cc.  The sanitized address
-   is in the returned register, maybe %o0.  The TLB addend is in %o1.  */
-
-static TCGReg tcg_out_tlb_load(TCGContext *s, TCGReg addr, int mem_index,
-                               MemOp opc, int which)
+static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 {
+    MemOp opc = get_memop(lb->oi);
+    MemOp sgn;
+
+    if (!patch_reloc(lb->label_ptr[0], R_SPARC_WDISP19,
+                     (intptr_t)tcg_splitwx_to_rx(s->code_ptr), 0)) {
+        return false;
+    }
+
+    /* Use inline tcg_out_ext32s; otherwise let the helper sign-extend. */
+    sgn = (opc & MO_SIZE) < MO_32 ? MO_SIGN : 0;
+
+    tcg_out_ld_helper_args(s, lb, &ldst_helper_param);
+    tcg_out_call(s, qemu_ld_helpers[opc & (MO_SIZE | sgn)], NULL);
+    tcg_out_ld_helper_ret(s, lb, sgn, &ldst_helper_param);
+
+    tcg_out_bpcc0(s, COND_A, BPCC_A | BPCC_PT, 0);
+    return patch_reloc(s->code_ptr - 1, R_SPARC_WDISP19,
+                       (intptr_t)lb->raddr, 0);
+}
+
+static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
+{
+    MemOp opc = get_memop(lb->oi);
+
+    if (!patch_reloc(lb->label_ptr[0], R_SPARC_WDISP19,
+                     (intptr_t)tcg_splitwx_to_rx(s->code_ptr), 0)) {
+        return false;
+    }
+
+    tcg_out_st_helper_args(s, lb, &ldst_helper_param);
+    tcg_out_call(s, qemu_st_helpers[opc & MO_SIZE], NULL);
+
+    tcg_out_bpcc0(s, COND_A, BPCC_A | BPCC_PT, 0);
+    return patch_reloc(s->code_ptr - 1, R_SPARC_WDISP19,
+                       (intptr_t)lb->raddr, 0);
+}
+
+typedef struct {
+    TCGReg base;
+    TCGReg index;
+} HostAddress;
+
+/*
+ * For softmmu, perform the TLB load and compare.
+ * For useronly, perform any required alignment tests.
+ * In both cases, return a TCGLabelQemuLdst structure if the slow path
+ * is required and fill in @h with the host address for the fast path.
+ */
+static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
+                                           TCGReg addr_reg, MemOpIdx oi,
+                                           bool is_ld)
+{
+    TCGLabelQemuLdst *ldst = NULL;
+    MemOp opc = get_memop(oi);
+    unsigned a_bits = get_alignment_bits(opc);
+    unsigned s_bits = opc & MO_SIZE;
+    unsigned a_mask;
+
+    /* We don't support unaligned accesses. */
+    a_bits = MAX(a_bits, s_bits);
+    a_mask = (1u << a_bits) - 1;
+
+#ifdef CONFIG_SOFTMMU
+    int mem_index = get_mmuidx(oi);
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
     int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
     int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    const TCGReg r0 = TCG_REG_O0;
-    const TCGReg r1 = TCG_REG_O1;
-    const TCGReg r2 = TCG_REG_O2;
-    unsigned s_bits = opc & MO_SIZE;
-    unsigned a_bits = get_alignment_bits(opc);
-    tcg_target_long compare_mask;
+    int cmp_off = is_ld ? offsetof(CPUTLBEntry, addr_read)
+                        : offsetof(CPUTLBEntry, addr_write);
+    int add_off = offsetof(CPUTLBEntry, addend);
+    int compare_mask;
+    int cc;
 
     /* Load tlb_mask[mmu_idx] and tlb_table[mmu_idx].  */
-    tcg_out_ld(s, TCG_TYPE_PTR, r0, TCG_AREG0, mask_off);
-    tcg_out_ld(s, TCG_TYPE_PTR, r1, TCG_AREG0, table_off);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
+    QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 12));
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_T2, TCG_AREG0, mask_off);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_T3, TCG_AREG0, table_off);
 
     /* Extract the page index, shifted into place for tlb index.  */
-    tcg_out_arithi(s, r2, addr, TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS,
-                   SHIFT_SRL);
-    tcg_out_arith(s, r2, r2, r0, ARITH_AND);
+    tcg_out_arithi(s, TCG_REG_T1, addr_reg,
+                   TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS, SHIFT_SRL);
+    tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T2, ARITH_AND);
 
     /* Add the tlb_table pointer, creating the CPUTLBEntry address into R2.  */
-    tcg_out_arith(s, r2, r2, r1, ARITH_ADD);
+    tcg_out_arith(s, TCG_REG_T1, TCG_REG_T1, TCG_REG_T3, ARITH_ADD);
 
-    /* Load the tlb comparator and the addend.  */
-    tcg_out_ld(s, TCG_TYPE_TL, r0, r2, which);
-    tcg_out_ld(s, TCG_TYPE_PTR, r1, r2, offsetof(CPUTLBEntry, addend));
+    /* Load the tlb comparator and the addend. */
+    tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_T2, TCG_REG_T1, cmp_off);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_T1, TCG_REG_T1, add_off);
+    h->base = TCG_REG_T1;
 
-    /* Mask out the page offset, except for the required alignment.
-       We don't support unaligned accesses.  */
-    if (a_bits < s_bits) {
-        a_bits = s_bits;
-    }
-    compare_mask = (tcg_target_ulong)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
+    /* Mask out the page offset, except for the required alignment. */
+    compare_mask = TARGET_PAGE_MASK | a_mask;
     if (check_fit_tl(compare_mask, 13)) {
-        tcg_out_arithi(s, r2, addr, compare_mask, ARITH_AND);
+        tcg_out_arithi(s, TCG_REG_T3, addr_reg, compare_mask, ARITH_AND);
     } else {
-        tcg_out_movi(s, TCG_TYPE_TL, r2, compare_mask);
-        tcg_out_arith(s, r2, addr, r2, ARITH_AND);
+        tcg_out_movi_s32(s, TCG_REG_T3, compare_mask);
+        tcg_out_arith(s, TCG_REG_T3, addr_reg, TCG_REG_T3, ARITH_AND);
     }
-    tcg_out_cmp(s, r0, r2, 0);
+    tcg_out_cmp(s, TCG_REG_T2, TCG_REG_T3, 0);
 
-    /* If the guest address must be zero-extended, do so now.  */
+    ldst = new_ldst_label(s);
+    ldst->is_ld = is_ld;
+    ldst->oi = oi;
+    ldst->addrlo_reg = addr_reg;
+    ldst->label_ptr[0] = s->code_ptr;
+
+    /* bne,pn %[xi]cc, label0 */
+    cc = TARGET_LONG_BITS == 64 ? BPCC_XCC : BPCC_ICC;
+    tcg_out_bpcc0(s, COND_NE, BPCC_PN | cc, 0);
+#else
+    if (a_bits != s_bits) {
+        /*
+         * Test for at least natural alignment, and defer
+         * everything else to the helper functions.
+         */
+        tcg_debug_assert(check_fit_tl(a_mask, 13));
+        tcg_out_arithi(s, TCG_REG_G0, addr_reg, a_mask, ARITH_ANDCC);
+
+        ldst = new_ldst_label(s);
+        ldst->is_ld = is_ld;
+        ldst->oi = oi;
+        ldst->addrlo_reg = addr_reg;
+        ldst->label_ptr[0] = s->code_ptr;
+
+        /* bne,pn %icc, label0 */
+        tcg_out_bpcc0(s, COND_NE, BPCC_PN | BPCC_ICC, 0);
+    }
+    h->base = guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0;
+#endif
+
+    /* If the guest address must be zero-extended, do in the delay slot.  */
     if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, r0, addr);
-        return r0;
+        tcg_out_ext32u(s, TCG_REG_T2, addr_reg);
+        h->index = TCG_REG_T2;
+    } else {
+        if (ldst) {
+            tcg_out_nop(s);
+        }
+        h->index = addr_reg;
     }
-    return addr;
+    return ldst;
 }
-#endif /* CONFIG_SOFTMMU */
-
-static const int qemu_ld_opc[(MO_SSIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = LDUB,
-    [MO_SB]   = LDSB,
-    [MO_UB | MO_LE] = LDUB,
-    [MO_SB | MO_LE] = LDSB,
-
-    [MO_BEUW] = LDUH,
-    [MO_BESW] = LDSH,
-    [MO_BEUL] = LDUW,
-    [MO_BESL] = LDSW,
-    [MO_BEUQ] = LDX,
-    [MO_BESQ] = LDX,
-
-    [MO_LEUW] = LDUH_LE,
-    [MO_LESW] = LDSH_LE,
-    [MO_LEUL] = LDUW_LE,
-    [MO_LESL] = LDSW_LE,
-    [MO_LEUQ] = LDX_LE,
-    [MO_LESQ] = LDX_LE,
-};
-
-static const int qemu_st_opc[(MO_SIZE | MO_BSWAP) + 1] = {
-    [MO_UB]   = STB,
-
-    [MO_BEUW] = STH,
-    [MO_BEUL] = STW,
-    [MO_BEUQ] = STX,
-
-    [MO_LEUW] = STH_LE,
-    [MO_LEUL] = STW_LE,
-    [MO_LEUQ] = STX_LE,
-};
 
 static void tcg_out_qemu_ld(TCGContext *s, TCGReg data, TCGReg addr,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp memop = get_memop(oi);
-    tcg_insn_unit *label_ptr;
+    static const int ld_opc[(MO_SSIZE | MO_BSWAP) + 1] = {
+        [MO_UB]   = LDUB,
+        [MO_SB]   = LDSB,
+        [MO_UB | MO_LE] = LDUB,
+        [MO_SB | MO_LE] = LDSB,
 
-#ifdef CONFIG_SOFTMMU
-    unsigned memi = get_mmuidx(oi);
-    TCGReg addrz;
-    const tcg_insn_unit *func;
+        [MO_BEUW] = LDUH,
+        [MO_BESW] = LDSH,
+        [MO_BEUL] = LDUW,
+        [MO_BESL] = LDSW,
+        [MO_BEUQ] = LDX,
+        [MO_BESQ] = LDX,
 
-    addrz = tcg_out_tlb_load(s, addr, memi, memop,
-                             offsetof(CPUTLBEntry, addr_read));
+        [MO_LEUW] = LDUH_LE,
+        [MO_LESW] = LDSH_LE,
+        [MO_LEUL] = LDUW_LE,
+        [MO_LESL] = LDSW_LE,
+        [MO_LEUQ] = LDX_LE,
+        [MO_LESQ] = LDX_LE,
+    };
 
-    /* The fast path is exactly one insn.  Thus we can perform the
-       entire TLB Hit in the (annulled) delay slot of the branch
-       over the TLB Miss case.  */
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
 
-    /* beq,a,pt %[xi]cc, label0 */
-    label_ptr = s->code_ptr;
-    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT
-                  | (TARGET_LONG_BITS == 64 ? BPCC_XCC : BPCC_ICC), 0);
-    /* delay slot */
-    tcg_out_ldst_rr(s, data, addrz, TCG_REG_O1,
-                    qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
+    ldst = prepare_host_addr(s, &h, addr, oi, true);
 
-    /* TLB Miss.  */
+    tcg_out_ldst_rr(s, data, h.base, h.index,
+                    ld_opc[get_memop(oi) & (MO_BSWAP | MO_SSIZE)]);
 
-    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O1, addrz);
-
-    /* We use the helpers to extend SB and SW data, leaving the case
-       of SL needing explicit extending below.  */
-    if ((memop & MO_SSIZE) == MO_SL) {
-        func = qemu_ld_trampoline[MO_UL];
-    } else {
-        func = qemu_ld_trampoline[memop & MO_SSIZE];
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-    tcg_debug_assert(func != NULL);
-    tcg_out_call_nodelay(s, func, false);
-    /* delay slot */
-    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_O2, oi);
-
-    /* We let the helper sign-extend SB and SW, but leave SL for here.  */
-    if ((memop & MO_SSIZE) == MO_SL) {
-        tcg_out_ext32s(s, data, TCG_REG_O0);
-    } else {
-        tcg_out_mov(s, TCG_TYPE_REG, data, TCG_REG_O0);
-    }
-
-    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
-#else
-    TCGReg index = (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0);
-    unsigned a_bits = get_alignment_bits(memop);
-    unsigned s_bits = memop & MO_SIZE;
-    unsigned t_bits;
-
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_T1, addr);
-        addr = TCG_REG_T1;
-    }
-
-    /*
-     * Normal case: alignment equal to access size.
-     */
-    if (a_bits == s_bits) {
-        tcg_out_ldst_rr(s, data, addr, index,
-                        qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
-        return;
-    }
-
-    /*
-     * Test for at least natural alignment, and assume most accesses
-     * will be aligned -- perform a straight load in the delay slot.
-     * This is required to preserve atomicity for aligned accesses.
-     */
-    t_bits = MAX(a_bits, s_bits);
-    tcg_debug_assert(t_bits < 13);
-    tcg_out_arithi(s, TCG_REG_G0, addr, (1u << t_bits) - 1, ARITH_ANDCC);
-
-    /* beq,a,pt %icc, label */
-    label_ptr = s->code_ptr;
-    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT | BPCC_ICC, 0);
-    /* delay slot */
-    tcg_out_ldst_rr(s, data, addr, index,
-                    qemu_ld_opc[memop & (MO_BSWAP | MO_SSIZE)]);
-
-    if (a_bits >= s_bits) {
-        /*
-         * Overalignment: A successful alignment test will perform the memory
-         * operation in the delay slot, and failure need only invoke the
-         * handler for SIGBUS.
-         */
-        tcg_out_call_nodelay(s, qemu_unalign_ld_trampoline, false);
-        /* delay slot -- move to low part of argument reg */
-        tcg_out_mov_delay(s, TCG_REG_O1, addr);
-    } else {
-        /* Underalignment: load by pieces of minimum alignment. */
-        int ld_opc, a_size, s_size, i;
-
-        /*
-         * Force full address into T1 early; avoids problems with
-         * overlap between @addr and @data.
-         */
-        tcg_out_arith(s, TCG_REG_T1, addr, index, ARITH_ADD);
-
-        a_size = 1 << a_bits;
-        s_size = 1 << s_bits;
-        if ((memop & MO_BSWAP) == MO_BE) {
-            ld_opc = qemu_ld_opc[a_bits | MO_BE | (memop & MO_SIGN)];
-            tcg_out_ldst(s, data, TCG_REG_T1, 0, ld_opc);
-            ld_opc = qemu_ld_opc[a_bits | MO_BE];
-            for (i = a_size; i < s_size; i += a_size) {
-                tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, ld_opc);
-                tcg_out_arithi(s, data, data, a_size, SHIFT_SLLX);
-                tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
-            }
-        } else if (a_bits == 0) {
-            ld_opc = LDUB;
-            tcg_out_ldst(s, data, TCG_REG_T1, 0, ld_opc);
-            for (i = a_size; i < s_size; i += a_size) {
-                if ((memop & MO_SIGN) && i == s_size - a_size) {
-                    ld_opc = LDSB;
-                }
-                tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, ld_opc);
-                tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, i * 8, SHIFT_SLLX);
-                tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
-            }
-        } else {
-            ld_opc = qemu_ld_opc[a_bits | MO_LE];
-            tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0, ld_opc);
-            for (i = a_size; i < s_size; i += a_size) {
-                tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, a_size, ARITH_ADD);
-                if ((memop & MO_SIGN) && i == s_size - a_size) {
-                    ld_opc = qemu_ld_opc[a_bits | MO_LE | MO_SIGN];
-                }
-                tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, ld_opc);
-                tcg_out_arithi(s, TCG_REG_T2, TCG_REG_T2, i * 8, SHIFT_SLLX);
-                tcg_out_arith(s, data, data, TCG_REG_T2, ARITH_OR);
-            }
-        }
-    }
-
-    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
-#endif /* CONFIG_SOFTMMU */
 }
 
 static void tcg_out_qemu_st(TCGContext *s, TCGReg data, TCGReg addr,
                             MemOpIdx oi, TCGType data_type)
 {
-    MemOp memop = get_memop(oi);
-    tcg_insn_unit *label_ptr;
+    static const int st_opc[(MO_SIZE | MO_BSWAP) + 1] = {
+        [MO_UB]   = STB,
 
-#ifdef CONFIG_SOFTMMU
-    unsigned memi = get_mmuidx(oi);
-    TCGReg addrz;
-    const tcg_insn_unit *func;
+        [MO_BEUW] = STH,
+        [MO_BEUL] = STW,
+        [MO_BEUQ] = STX,
 
-    addrz = tcg_out_tlb_load(s, addr, memi, memop,
-                             offsetof(CPUTLBEntry, addr_write));
+        [MO_LEUW] = STH_LE,
+        [MO_LEUL] = STW_LE,
+        [MO_LEUQ] = STX_LE,
+    };
 
-    /* The fast path is exactly one insn.  Thus we can perform the entire
-       TLB Hit in the (annulled) delay slot of the branch over TLB Miss.  */
-    /* beq,a,pt %[xi]cc, label0 */
-    label_ptr = s->code_ptr;
-    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT
-                  | (TARGET_LONG_BITS == 64 ? BPCC_XCC : BPCC_ICC), 0);
-    /* delay slot */
-    tcg_out_ldst_rr(s, data, addrz, TCG_REG_O1,
-                    qemu_st_opc[memop & (MO_BSWAP | MO_SIZE)]);
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
 
-    /* TLB Miss.  */
+    ldst = prepare_host_addr(s, &h, addr, oi, false);
 
-    tcg_out_mov(s, TCG_TYPE_REG, TCG_REG_O1, addrz);
-    tcg_out_movext(s, (memop & MO_SIZE) == MO_64 ? TCG_TYPE_I64 : TCG_TYPE_I32,
-                   TCG_REG_O2, data_type, memop & MO_SIZE, data);
+    tcg_out_ldst_rr(s, data, h.base, h.index,
+                    st_opc[get_memop(oi) & (MO_BSWAP | MO_SIZE)]);
 
-    func = qemu_st_trampoline[memop & MO_SIZE];
-    tcg_debug_assert(func != NULL);
-    tcg_out_call_nodelay(s, func, false);
-    /* delay slot */
-    tcg_out_movi(s, TCG_TYPE_I32, TCG_REG_O3, oi);
-
-    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
-#else
-    TCGReg index = (guest_base ? TCG_GUEST_BASE_REG : TCG_REG_G0);
-    unsigned a_bits = get_alignment_bits(memop);
-    unsigned s_bits = memop & MO_SIZE;
-    unsigned t_bits;
-
-    if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_T1, addr);
-        addr = TCG_REG_T1;
+    if (ldst) {
+        ldst->type = data_type;
+        ldst->datalo_reg = data;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
     }
-
-    /*
-     * Normal case: alignment equal to access size.
-     */
-    if (a_bits == s_bits) {
-        tcg_out_ldst_rr(s, data, addr, index,
-                        qemu_st_opc[memop & (MO_BSWAP | MO_SIZE)]);
-        return;
-    }
-
-    /*
-     * Test for at least natural alignment, and assume most accesses
-     * will be aligned -- perform a straight store in the delay slot.
-     * This is required to preserve atomicity for aligned accesses.
-     */
-    t_bits = MAX(a_bits, s_bits);
-    tcg_debug_assert(t_bits < 13);
-    tcg_out_arithi(s, TCG_REG_G0, addr, (1u << t_bits) - 1, ARITH_ANDCC);
-
-    /* beq,a,pt %icc, label */
-    label_ptr = s->code_ptr;
-    tcg_out_bpcc0(s, COND_E, BPCC_A | BPCC_PT | BPCC_ICC, 0);
-    /* delay slot */
-    tcg_out_ldst_rr(s, data, addr, index,
-                    qemu_st_opc[memop & (MO_BSWAP | MO_SIZE)]);
-
-    if (a_bits >= s_bits) {
-        /*
-         * Overalignment: A successful alignment test will perform the memory
-         * operation in the delay slot, and failure need only invoke the
-         * handler for SIGBUS.
-         */
-        tcg_out_call_nodelay(s, qemu_unalign_st_trampoline, false);
-        /* delay slot -- move to low part of argument reg */
-        tcg_out_mov_delay(s, TCG_REG_O1, addr);
-    } else {
-        /* Underalignment: store by pieces of minimum alignment. */
-        int st_opc, a_size, s_size, i;
-
-        /*
-         * Force full address into T1 early; avoids problems with
-         * overlap between @addr and @data.
-         */
-        tcg_out_arith(s, TCG_REG_T1, addr, index, ARITH_ADD);
-
-        a_size = 1 << a_bits;
-        s_size = 1 << s_bits;
-        if ((memop & MO_BSWAP) == MO_BE) {
-            st_opc = qemu_st_opc[a_bits | MO_BE];
-            for (i = 0; i < s_size; i += a_size) {
-                TCGReg d = data;
-                int shift = (s_size - a_size - i) * 8;
-                if (shift) {
-                    d = TCG_REG_T2;
-                    tcg_out_arithi(s, d, data, shift, SHIFT_SRLX);
-                }
-                tcg_out_ldst(s, d, TCG_REG_T1, i, st_opc);
-            }
-        } else if (a_bits == 0) {
-            tcg_out_ldst(s, data, TCG_REG_T1, 0, STB);
-            for (i = 1; i < s_size; i++) {
-                tcg_out_arithi(s, TCG_REG_T2, data, i * 8, SHIFT_SRLX);
-                tcg_out_ldst(s, TCG_REG_T2, TCG_REG_T1, i, STB);
-            }
-        } else {
-            /* Note that ST*A with immediate asi must use indexed address. */
-            st_opc = qemu_st_opc[a_bits + MO_LE];
-            tcg_out_ldst_rr(s, data, TCG_REG_T1, TCG_REG_G0, st_opc);
-            for (i = a_size; i < s_size; i += a_size) {
-                tcg_out_arithi(s, TCG_REG_T2, data, i * 8, SHIFT_SRLX);
-                tcg_out_arithi(s, TCG_REG_T1, TCG_REG_T1, a_size, ARITH_ADD);
-                tcg_out_ldst_rr(s, TCG_REG_T2, TCG_REG_T1, TCG_REG_G0, st_opc);
-            }
-        }
-    }
-
-    *label_ptr |= INSN_OFF19(tcg_ptr_byte_diff(s->code_ptr, label_ptr));
-#endif /* CONFIG_SOFTMMU */
 }
 
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
@@ -1744,6 +1499,8 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_extu_i32_i64:
     case INDEX_op_extrl_i64_i32:
     case INDEX_op_extrh_i64_i32:
+    case INDEX_op_qemu_ld_i32:
+    case INDEX_op_qemu_ld_i64:
         return C_O1_I1(r, r);
 
     case INDEX_op_st8_i32:
@@ -1753,6 +1510,8 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_st_i32:
     case INDEX_op_st32_i64:
     case INDEX_op_st_i64:
+    case INDEX_op_qemu_st_i32:
+    case INDEX_op_qemu_st_i64:
         return C_O0_I2(rZ, r);
 
     case INDEX_op_add_i32:
@@ -1802,13 +1561,6 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_muluh_i64:
         return C_O1_I2(r, r, r);
 
-    case INDEX_op_qemu_ld_i32:
-    case INDEX_op_qemu_ld_i64:
-        return C_O1_I1(r, s);
-    case INDEX_op_qemu_st_i32:
-    case INDEX_op_qemu_st_i64:
-        return C_O0_I2(sZ, s);
-
     default:
         g_assert_not_reached();
     }
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 35/57] accel/tcg: Remove helper_unaligned_{ld,st}
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (33 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 34/57] tcg/sparc64: Use standard slow path for softmmu Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:27   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses Richard Henderson
                   ` (22 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

These functions are now unused.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/tcg/tcg-ldst.h |  6 ------
 accel/tcg/user-exec.c  | 10 ----------
 2 files changed, 16 deletions(-)

diff --git a/include/tcg/tcg-ldst.h b/include/tcg/tcg-ldst.h
index 64f48e6990..7dd57013e9 100644
--- a/include/tcg/tcg-ldst.h
+++ b/include/tcg/tcg-ldst.h
@@ -60,10 +60,4 @@ void helper_stq_mmu(CPUArchState *env, target_ulong addr, uint64_t val,
 void helper_st16_mmu(CPUArchState *env, target_ulong addr, Int128 val,
                      MemOpIdx oi, uintptr_t retaddr);
 
-#ifdef CONFIG_USER_ONLY
-
-G_NORETURN void helper_unaligned_ld(CPUArchState *env, target_ulong addr);
-G_NORETURN void helper_unaligned_st(CPUArchState *env, target_ulong addr);
-
-#endif /* CONFIG_USER_ONLY */
 #endif /* TCG_LDST_H */
diff --git a/accel/tcg/user-exec.c b/accel/tcg/user-exec.c
index 8f86254eb4..7b824dcde8 100644
--- a/accel/tcg/user-exec.c
+++ b/accel/tcg/user-exec.c
@@ -889,16 +889,6 @@ void page_reset_target_data(target_ulong start, target_ulong last) { }
 
 /* The softmmu versions of these helpers are in cputlb.c.  */
 
-void helper_unaligned_ld(CPUArchState *env, target_ulong addr)
-{
-    cpu_loop_exit_sigbus(env_cpu(env), addr, MMU_DATA_LOAD, GETPC());
-}
-
-void helper_unaligned_st(CPUArchState *env, target_ulong addr)
-{
-    cpu_loop_exit_sigbus(env_cpu(env), addr, MMU_DATA_STORE, GETPC());
-}
-
 static void *cpu_mmu_lookup(CPUArchState *env, abi_ptr addr,
                             MemOp mop, uintptr_t ra, MMUAccessType type)
 {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (34 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 35/57] accel/tcg: Remove helper_unaligned_{ld,st} Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:30   ` Peter Maydell
  2023-05-05 13:24   ` WANG Xuerui
  2023-05-03  7:06 ` [PATCH v4 37/57] tcg/loongarch64: Support softmmu " Richard Henderson
                   ` (21 subsequent siblings)
  57 siblings, 2 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

This should be true of all server class loongarch64.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/loongarch64/tcg-target.c.inc | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index e651ec5c71..ccc13ffdb4 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -30,6 +30,7 @@
  */
 
 #include "../tcg-ldst.c.inc"
+#include <asm/hwcap.h>
 
 #ifdef CONFIG_DEBUG_TCG
 static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
@@ -1674,6 +1675,11 @@ static void tcg_target_qemu_prologue(TCGContext *s)
 
 static void tcg_target_init(TCGContext *s)
 {
+    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
+
+    /* All server class loongarch have UAL; only embedded do not. */
+    assert(hwcap & HWCAP_LOONGARCH_UAL);
+
     tcg_target_available_regs[TCG_TYPE_I32] = ALL_GENERAL_REGS;
     tcg_target_available_regs[TCG_TYPE_I64] = ALL_GENERAL_REGS;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 37/57] tcg/loongarch64: Support softmmu unaligned accesses
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (35 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:35   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 38/57] tcg/riscv: " Richard Henderson
                   ` (20 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Test the final byte of an unaligned access.
Use BSTRINS.D to clear the range of bits, rather than AND.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/loongarch64/tcg-target.c.inc | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index ccc13ffdb4..20cb21b264 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -848,7 +848,6 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
     int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
     int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
-    tcg_target_long compare_mask;
 
     ldst = new_ldst_label(s);
     ldst->is_ld = is_ld;
@@ -872,14 +871,20 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
                offsetof(CPUTLBEntry, addend));
 
-    /* We don't support unaligned accesses.  */
+    /*
+     * For aligned accesses, we check the first byte and include the alignment
+     * bits within the address.  For unaligned access, we check that we don't
+     * cross pages using the address of the last byte of the access.
+     */
     if (a_bits < s_bits) {
-        a_bits = s_bits;
+        unsigned a_mask = (1u << a_bits) - 1;
+        unsigned s_mask = (1u << s_bits) - 1;
+        tcg_out_addi(s, TCG_TYPE_TL, TCG_REG_TMP1, addr_reg, s_mask - a_mask);
+    } else {
+        tcg_out_mov(s, TCG_TYPE_TL, TCG_REG_TMP1, addr_reg);
     }
-    /* Clear the non-page, non-alignment bits from the address.  */
-    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | ((1 << a_bits) - 1);
-    tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
-    tcg_out_opc_and(s, TCG_REG_TMP1, TCG_REG_TMP1, addr_reg);
+    tcg_out_opc_bstrins_d(s, TCG_REG_TMP1, TCG_REG_ZERO,
+                          a_bits, TARGET_PAGE_BITS - 1);
 
     /* Compare masked address with the TLB entry.  */
     ldst->label_ptr[0] = s->code_ptr;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 38/57] tcg/riscv: Support softmmu unaligned accesses
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (36 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 37/57] tcg/loongarch64: Support softmmu " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 10:35   ` LIU Zhiwei
  2023-05-03  7:06 ` [PATCH v4 39/57] tcg: Introduce tcg_target_has_memory_bswap Richard Henderson
                   ` (19 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

The system is required to emulate unaligned accesses, even if the
hardware does not support it.  The resulting trap may or may not
be more efficient than the qemu slow path.  There are linux kernel
patches in flight to allow userspace to query hardware support;
we can re-evaluate whether to enable this by default after that.

In the meantime, softmmu now matches useronly, where we already
assumed that unaligned accesses are supported.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/riscv/tcg-target.c.inc | 48 ++++++++++++++++++++++----------------
 1 file changed, 28 insertions(+), 20 deletions(-)

diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index 19cd4507fb..415e6c6e15 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -910,12 +910,13 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
 
 #ifdef CONFIG_SOFTMMU
     unsigned s_bits = opc & MO_SIZE;
+    unsigned s_mask = (1u << s_bits) - 1;
     int mem_index = get_mmuidx(oi);
     int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
     int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
     int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
-    TCGReg mask_base = TCG_AREG0, table_base = TCG_AREG0;
-    tcg_target_long compare_mask;
+    int compare_mask;
+    TCGReg addr_adj;
 
     ldst = new_ldst_label(s);
     ldst->is_ld = is_ld;
@@ -924,14 +925,33 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
 
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
     QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, mask_base, mask_ofs);
-    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, table_base, table_ofs);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, TCG_AREG0, mask_ofs);
+    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, table_ofs);
 
     tcg_out_opc_imm(s, OPC_SRLI, TCG_REG_TMP2, addr_reg,
                     TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
     tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
     tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
 
+    /*
+     * For aligned accesses, we check the first byte and include the alignment
+     * bits within the address.  For unaligned access, we check that we don't
+     * cross pages using the address of the last byte of the access.
+     */
+    addr_adj = addr_reg;
+    if (a_bits < s_bits) {
+        addr_adj = TCG_REG_TMP0;
+        tcg_out_opc_imm(s, TARGET_LONG_BITS == 32 ? OPC_ADDIW : OPC_ADDI,
+                        addr_adj, addr_reg, s_mask - a_mask);
+    }
+    compare_mask = TARGET_PAGE_MASK | a_mask;
+    if (compare_mask == sextreg(compare_mask, 0, 12)) {
+        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_adj, compare_mask);
+    } else {
+        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
+        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr_adj);
+    }
+
     /* Load the tlb comparator and the addend.  */
     tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
                is_ld ? offsetof(CPUTLBEntry, addr_read)
@@ -939,29 +959,17 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
     tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
                offsetof(CPUTLBEntry, addend));
 
-    /* We don't support unaligned accesses. */
-    if (a_bits < s_bits) {
-        a_bits = s_bits;
-    }
-    /* Clear the non-page, non-alignment bits from the address.  */
-    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | a_mask;
-    if (compare_mask == sextreg(compare_mask, 0, 12)) {
-        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, compare_mask);
-    } else {
-        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
-        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr_reg);
-    }
-
     /* Compare masked address with the TLB entry. */
     ldst->label_ptr[0] = s->code_ptr;
     tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP0, TCG_REG_TMP1, 0);
 
     /* TLB Hit - translate address using addend.  */
+    addr_adj = addr_reg;
     if (TARGET_LONG_BITS == 32) {
-        tcg_out_ext32u(s, TCG_REG_TMP0, addr_reg);
-        addr_reg = TCG_REG_TMP0;
+        addr_adj = TCG_REG_TMP0;
+        tcg_out_ext32u(s, addr_adj, addr_reg);
     }
-    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr_reg);
+    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr_adj);
     *pbase = TCG_REG_TMP0;
 #else
     if (a_mask) {
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 39/57] tcg: Introduce tcg_target_has_memory_bswap
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (37 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 38/57] tcg/riscv: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:41   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 40/57] tcg: Add INDEX_op_qemu_{ld,st}_i128 Richard Henderson
                   ` (18 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Replace the unparameterized TCG_TARGET_HAS_MEMORY_BSWAP macro
with a function with a memop argument.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.h         |  1 -
 tcg/arm/tcg-target.h             |  1 -
 tcg/i386/tcg-target.h            |  3 ---
 tcg/loongarch64/tcg-target.h     |  2 --
 tcg/mips/tcg-target.h            |  2 --
 tcg/ppc/tcg-target.h             |  1 -
 tcg/riscv/tcg-target.h           |  2 --
 tcg/s390x/tcg-target.h           |  2 --
 tcg/sparc64/tcg-target.h         |  1 -
 tcg/tcg-internal.h               |  2 ++
 tcg/tci/tcg-target.h             |  2 --
 tcg/tcg-op.c                     | 20 +++++++++++---------
 tcg/aarch64/tcg-target.c.inc     |  5 +++++
 tcg/arm/tcg-target.c.inc         |  5 +++++
 tcg/i386/tcg-target.c.inc        |  5 +++++
 tcg/loongarch64/tcg-target.c.inc |  5 +++++
 tcg/mips/tcg-target.c.inc        |  5 +++++
 tcg/ppc/tcg-target.c.inc         |  5 +++++
 tcg/riscv/tcg-target.c.inc       |  5 +++++
 tcg/s390x/tcg-target.c.inc       |  5 +++++
 tcg/sparc64/tcg-target.c.inc     |  5 +++++
 tcg/tci/tcg-target.c.inc         |  5 +++++
 22 files changed, 63 insertions(+), 26 deletions(-)

diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 3c0b0d312d..378e01d9d8 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -154,7 +154,6 @@ extern bool have_lse2;
 #define TCG_TARGET_HAS_cmpsel_vec       0
 
 #define TCG_TARGET_DEFAULT_MO (0)
-#define TCG_TARGET_HAS_MEMORY_BSWAP     0
 #define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
 
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index def2a189e6..4c2d3332d5 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -150,7 +150,6 @@ extern bool use_neon_instructions;
 #define TCG_TARGET_HAS_cmpsel_vec       0
 
 #define TCG_TARGET_DEFAULT_MO (0)
-#define TCG_TARGET_HAS_MEMORY_BSWAP     0
 #define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
 
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 0421776cb8..8fe6958abd 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -240,9 +240,6 @@ extern bool have_atomic16;
 #include "tcg/tcg-mo.h"
 
 #define TCG_TARGET_DEFAULT_MO (TCG_MO_ALL & ~TCG_MO_ST_LD)
-
-#define TCG_TARGET_HAS_MEMORY_BSWAP  have_movbe
-
 #define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
 
diff --git a/tcg/loongarch64/tcg-target.h b/tcg/loongarch64/tcg-target.h
index 17b8193aa5..75c3d80ed2 100644
--- a/tcg/loongarch64/tcg-target.h
+++ b/tcg/loongarch64/tcg-target.h
@@ -173,6 +173,4 @@ typedef enum {
 
 #define TCG_TARGET_NEED_LDST_LABELS
 
-#define TCG_TARGET_HAS_MEMORY_BSWAP 0
-
 #endif /* LOONGARCH_TCG_TARGET_H */
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index 42bd7fff01..47088af9cb 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -205,8 +205,6 @@ extern bool use_mips32r2_instructions;
 #endif
 
 #define TCG_TARGET_DEFAULT_MO           0
-#define TCG_TARGET_HAS_MEMORY_BSWAP     0
-
 #define TCG_TARGET_NEED_LDST_LABELS
 
 #endif
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index af81c5a57f..d55f0266bb 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -179,7 +179,6 @@ extern bool have_vsx;
 #define TCG_TARGET_HAS_cmpsel_vec       0
 
 #define TCG_TARGET_DEFAULT_MO (0)
-#define TCG_TARGET_HAS_MEMORY_BSWAP     1
 #define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
 
diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
index dddf2486c1..dece3b3c27 100644
--- a/tcg/riscv/tcg-target.h
+++ b/tcg/riscv/tcg-target.h
@@ -168,6 +168,4 @@ typedef enum {
 #define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
 
-#define TCG_TARGET_HAS_MEMORY_BSWAP 0
-
 #endif
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index a05b473117..fe05680124 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -172,8 +172,6 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_CALL_ARG_I128        TCG_CALL_ARG_BY_REF
 #define TCG_TARGET_CALL_RET_I128        TCG_CALL_RET_BY_REF
 
-#define TCG_TARGET_HAS_MEMORY_BSWAP   1
-
 #define TCG_TARGET_DEFAULT_MO (TCG_MO_ALL & ~TCG_MO_ST_LD)
 #define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
index 7434cc99d4..f6cd86975a 100644
--- a/tcg/sparc64/tcg-target.h
+++ b/tcg/sparc64/tcg-target.h
@@ -154,7 +154,6 @@ extern bool use_vis3_instructions;
 #define TCG_AREG0 TCG_REG_I0
 
 #define TCG_TARGET_DEFAULT_MO (0)
-#define TCG_TARGET_HAS_MEMORY_BSWAP     1
 #define TCG_TARGET_NEED_LDST_LABELS
 #define TCG_TARGET_NEED_POOL_LABELS
 
diff --git a/tcg/tcg-internal.h b/tcg/tcg-internal.h
index 0f1ba01a9a..67b698bd5c 100644
--- a/tcg/tcg-internal.h
+++ b/tcg/tcg-internal.h
@@ -126,4 +126,6 @@ static inline TCGv_i64 TCGV128_HIGH(TCGv_i128 t)
     return temp_tcgv_i64(tcgv_i128_temp(t) + o);
 }
 
+bool tcg_target_has_memory_bswap(MemOp memop);
+
 #endif /* TCG_INTERNAL_H */
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index 7140a76a73..364012e4d2 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -176,6 +176,4 @@ typedef enum {
    We prefer consistency across hosts on this.  */
 #define TCG_TARGET_DEFAULT_MO  (0)
 
-#define TCG_TARGET_HAS_MEMORY_BSWAP     1
-
 #endif /* TCG_TARGET_H */
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 9101d334b6..85f22458c9 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -2959,7 +2959,7 @@ void tcg_gen_qemu_ld_i32(TCGv_i32 val, TCGv addr, TCGArg idx, MemOp memop)
     oi = make_memop_idx(memop, idx);
 
     orig_memop = memop;
-    if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
+    if ((memop & MO_BSWAP) && !tcg_target_has_memory_bswap(memop)) {
         memop &= ~MO_BSWAP;
         /* The bswap primitive benefits from zero-extended input.  */
         if ((memop & MO_SSIZE) == MO_SW) {
@@ -2996,7 +2996,7 @@ void tcg_gen_qemu_st_i32(TCGv_i32 val, TCGv addr, TCGArg idx, MemOp memop)
     memop = tcg_canonicalize_memop(memop, 0, 1);
     oi = make_memop_idx(memop, idx);
 
-    if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
+    if ((memop & MO_BSWAP) && !tcg_target_has_memory_bswap(memop)) {
         swap = tcg_temp_ebb_new_i32();
         switch (memop & MO_SIZE) {
         case MO_16:
@@ -3045,7 +3045,7 @@ void tcg_gen_qemu_ld_i64(TCGv_i64 val, TCGv addr, TCGArg idx, MemOp memop)
     oi = make_memop_idx(memop, idx);
 
     orig_memop = memop;
-    if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
+    if ((memop & MO_BSWAP) && !tcg_target_has_memory_bswap(memop)) {
         memop &= ~MO_BSWAP;
         /* The bswap primitive benefits from zero-extended input.  */
         if ((memop & MO_SIGN) && (memop & MO_SIZE) < MO_64) {
@@ -3091,7 +3091,7 @@ void tcg_gen_qemu_st_i64(TCGv_i64 val, TCGv addr, TCGArg idx, MemOp memop)
     memop = tcg_canonicalize_memop(memop, 1, 1);
     oi = make_memop_idx(memop, idx);
 
-    if (!TCG_TARGET_HAS_MEMORY_BSWAP && (memop & MO_BSWAP)) {
+    if ((memop & MO_BSWAP) && !tcg_target_has_memory_bswap(memop)) {
         swap = tcg_temp_ebb_new_i64();
         switch (memop & MO_SIZE) {
         case MO_16:
@@ -3168,11 +3168,6 @@ static void canonicalize_memop_i128_as_i64(MemOp ret[2], MemOp orig)
     tcg_debug_assert((orig & MO_SIZE) == MO_128);
     tcg_debug_assert((orig & MO_SIGN) == 0);
 
-    /* Use a memory ordering implemented by the host. */
-    if (!TCG_TARGET_HAS_MEMORY_BSWAP && (orig & MO_BSWAP)) {
-        mop_1 &= ~MO_BSWAP;
-    }
-
     /* Reduce the size to 64-bit. */
     mop_1 = (mop_1 & ~MO_SIZE) | MO_64;
 
@@ -3202,6 +3197,13 @@ static void canonicalize_memop_i128_as_i64(MemOp ret[2], MemOp orig)
     default:
         g_assert_not_reached();
     }
+
+    /* Use a memory ordering implemented by the host. */
+    if ((orig & MO_BSWAP) && !tcg_target_has_memory_bswap(mop_1)) {
+        mop_1 &= ~MO_BSWAP;
+        mop_2 &= ~MO_BSWAP;
+    }
+
     ret[0] = mop_1;
     ret[1] = mop_2;
 }
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index 09c9ecad0f..8e5f3d3688 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -1595,6 +1595,11 @@ typedef struct {
     TCGType index_ext;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return false;
+}
+
 static const TCGLdstHelperParam ldst_helper_param = {
     .ntmp = 1, .tmp = { TCG_REG_TMP }
 };
diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index eb0542f32e..e5aed03247 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -1325,6 +1325,11 @@ typedef struct {
     bool index_scratch;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return false;
+}
+
 static TCGReg ldst_ra_gen(TCGContext *s, const TCGLabelQemuLdst *l, int arg)
 {
     /* We arrive at the slow path via "BLNE", so R14 contains l->raddr. */
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index e78d4d4aa7..7c72bf6684 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -1776,6 +1776,11 @@ typedef struct {
     int seg;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return have_movbe;
+}
+
 /*
  * Because i686 has no register parameters and because x86_64 has xchg
  * to handle addr/data register overlap, we have placed all input arguments
diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index 20cb21b264..62bf823084 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -828,6 +828,11 @@ typedef struct {
     TCGReg index;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return false;
+}
+
 /*
  * For softmmu, perform the TLB load and compare.
  * For useronly, perform any required alignment tests.
diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index fa0f334e8d..cd0254a0d7 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -1141,6 +1141,11 @@ typedef struct {
     MemOp align;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return false;
+}
+
 /*
  * For softmmu, perform the TLB load and compare.
  * For useronly, perform any required alignment tests.
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 733f67c7a5..f0a4118bbb 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -2017,6 +2017,11 @@ typedef struct {
     TCGReg index;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return true;
+}
+
 /*
  * For softmmu, perform the TLB load and compare.
  * For useronly, perform any required alignment tests.
diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index 415e6c6e15..37870c89fc 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -853,6 +853,11 @@ static void tcg_out_goto(TCGContext *s, const tcg_insn_unit *target)
     tcg_debug_assert(ok);
 }
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return false;
+}
+
 /* We have three temps, we might as well expose them. */
 static const TCGLdstHelperParam ldst_helper_param = {
     .ntmp = 3, .tmp = { TCG_REG_TMP0, TCG_REG_TMP1, TCG_REG_TMP2 }
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index de8aed5f77..22f0206b5a 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -1574,6 +1574,11 @@ typedef struct {
     int disp;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return true;
+}
+
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
                                    HostAddress h)
 {
diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index 0237188d65..bb23038529 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -1011,6 +1011,11 @@ typedef struct {
     TCGReg index;
 } HostAddress;
 
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return true;
+}
+
 /*
  * For softmmu, perform the TLB load and compare.
  * For useronly, perform any required alignment tests.
diff --git a/tcg/tci/tcg-target.c.inc b/tcg/tci/tcg-target.c.inc
index e31640d109..89f693050c 100644
--- a/tcg/tci/tcg-target.c.inc
+++ b/tcg/tci/tcg-target.c.inc
@@ -964,3 +964,8 @@ static void tcg_target_init(TCGContext *s)
 static inline void tcg_target_qemu_prologue(TCGContext *s)
 {
 }
+
+bool tcg_target_has_memory_bswap(MemOp memop)
+{
+    return true;
+}
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 40/57] tcg: Add INDEX_op_qemu_{ld,st}_i128
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (38 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 39/57] tcg: Introduce tcg_target_has_memory_bswap Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:45   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 41/57] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret} Richard Henderson
                   ` (17 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Add opcodes for backend support for 128-bit memory operations.

Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 include/tcg/tcg-opc.h        |  8 +++++
 tcg/aarch64/tcg-target.h     |  2 ++
 tcg/arm/tcg-target.h         |  2 ++
 tcg/i386/tcg-target.h        |  2 ++
 tcg/loongarch64/tcg-target.h |  1 +
 tcg/mips/tcg-target.h        |  2 ++
 tcg/ppc/tcg-target.h         |  2 ++
 tcg/riscv/tcg-target.h       |  2 ++
 tcg/s390x/tcg-target.h       |  2 ++
 tcg/sparc64/tcg-target.h     |  2 ++
 tcg/tci/tcg-target.h         |  2 ++
 tcg/tcg-op.c                 | 69 ++++++++++++++++++++++++++++++++----
 tcg/tcg.c                    |  4 +++
 docs/devel/tcg-ops.rst       | 11 +++---
 14 files changed, 101 insertions(+), 10 deletions(-)

diff --git a/include/tcg/tcg-opc.h b/include/tcg/tcg-opc.h
index dd444734d9..94cf7c5d6a 100644
--- a/include/tcg/tcg-opc.h
+++ b/include/tcg/tcg-opc.h
@@ -213,6 +213,14 @@ DEF(qemu_st8_i32, 0, TLADDR_ARGS + 1, 1,
     TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS |
     IMPL(TCG_TARGET_HAS_qemu_st8_i32))
 
+/* Only for 64-bit hosts at the moment. */
+DEF(qemu_ld_i128, 2, 1, 1,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT |
+    IMPL(TCG_TARGET_HAS_qemu_ldst_i128))
+DEF(qemu_st_i128, 0, 3, 1,
+    TCG_OPF_CALL_CLOBBER | TCG_OPF_SIDE_EFFECTS | TCG_OPF_64BIT |
+    IMPL(TCG_TARGET_HAS_qemu_ldst_i128))
+
 /* Host vector support.  */
 
 #define IMPLVEC  TCG_OPF_VECTOR | IMPL(TCG_TARGET_MAYBE_vec)
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 378e01d9d8..74ee2ed255 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -129,6 +129,8 @@ extern bool have_lse2;
 #define TCG_TARGET_HAS_muluh_i64        1
 #define TCG_TARGET_HAS_mulsh_i64        1
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 #define TCG_TARGET_HAS_v64              1
 #define TCG_TARGET_HAS_v128             1
 #define TCG_TARGET_HAS_v256             0
diff --git a/tcg/arm/tcg-target.h b/tcg/arm/tcg-target.h
index 4c2d3332d5..65efc538f4 100644
--- a/tcg/arm/tcg-target.h
+++ b/tcg/arm/tcg-target.h
@@ -125,6 +125,8 @@ extern bool use_neon_instructions;
 #define TCG_TARGET_HAS_rem_i32          0
 #define TCG_TARGET_HAS_qemu_st8_i32     0
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 #define TCG_TARGET_HAS_v64              use_neon_instructions
 #define TCG_TARGET_HAS_v128             use_neon_instructions
 #define TCG_TARGET_HAS_v256             0
diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 8fe6958abd..943af6775e 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -194,6 +194,8 @@ extern bool have_atomic16;
 #define TCG_TARGET_HAS_qemu_st8_i32     1
 #endif
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 /* We do not support older SSE systems, only beginning with AVX1.  */
 #define TCG_TARGET_HAS_v64              have_avx1
 #define TCG_TARGET_HAS_v128             have_avx1
diff --git a/tcg/loongarch64/tcg-target.h b/tcg/loongarch64/tcg-target.h
index 75c3d80ed2..482901ac15 100644
--- a/tcg/loongarch64/tcg-target.h
+++ b/tcg/loongarch64/tcg-target.h
@@ -168,6 +168,7 @@ typedef enum {
 #define TCG_TARGET_HAS_muls2_i64        0
 #define TCG_TARGET_HAS_muluh_i64        1
 #define TCG_TARGET_HAS_mulsh_i64        1
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
 
 #define TCG_TARGET_DEFAULT_MO (0)
 
diff --git a/tcg/mips/tcg-target.h b/tcg/mips/tcg-target.h
index 47088af9cb..7277a117ef 100644
--- a/tcg/mips/tcg-target.h
+++ b/tcg/mips/tcg-target.h
@@ -204,6 +204,8 @@ extern bool use_mips32r2_instructions;
 #define TCG_TARGET_HAS_ext16u_i64       0 /* andi rt, rs, 0xffff */
 #endif
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 #define TCG_TARGET_DEFAULT_MO           0
 #define TCG_TARGET_NEED_LDST_LABELS
 
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index d55f0266bb..0914380bd7 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -149,6 +149,8 @@ extern bool have_vsx;
 #define TCG_TARGET_HAS_mulsh_i64        1
 #endif
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 /*
  * While technically Altivec could support V64, it has no 64-bit store
  * instruction and substituting two 32-bit stores makes the generated
diff --git a/tcg/riscv/tcg-target.h b/tcg/riscv/tcg-target.h
index dece3b3c27..494c986b49 100644
--- a/tcg/riscv/tcg-target.h
+++ b/tcg/riscv/tcg-target.h
@@ -163,6 +163,8 @@ typedef enum {
 #define TCG_TARGET_HAS_muluh_i64        1
 #define TCG_TARGET_HAS_mulsh_i64        1
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 #define TCG_TARGET_DEFAULT_MO (0)
 
 #define TCG_TARGET_NEED_LDST_LABELS
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index fe05680124..170007bea5 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -140,6 +140,8 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_muluh_i64      0
 #define TCG_TARGET_HAS_mulsh_i64      0
 
+#define TCG_TARGET_HAS_qemu_ldst_i128 0
+
 #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
 #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
 #define TCG_TARGET_HAS_v256           0
diff --git a/tcg/sparc64/tcg-target.h b/tcg/sparc64/tcg-target.h
index f6cd86975a..31c5537379 100644
--- a/tcg/sparc64/tcg-target.h
+++ b/tcg/sparc64/tcg-target.h
@@ -151,6 +151,8 @@ extern bool use_vis3_instructions;
 #define TCG_TARGET_HAS_muluh_i64        use_vis3_instructions
 #define TCG_TARGET_HAS_mulsh_i64        0
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 #define TCG_AREG0 TCG_REG_I0
 
 #define TCG_TARGET_DEFAULT_MO (0)
diff --git a/tcg/tci/tcg-target.h b/tcg/tci/tcg-target.h
index 364012e4d2..28dc6d5cfc 100644
--- a/tcg/tci/tcg-target.h
+++ b/tcg/tci/tcg-target.h
@@ -127,6 +127,8 @@
 #define TCG_TARGET_HAS_mulu2_i32        1
 #endif /* TCG_TARGET_REG_BITS == 64 */
 
+#define TCG_TARGET_HAS_qemu_ldst_i128   0
+
 /* Number of registers available. */
 #define TCG_TARGET_NB_REGS 16
 
diff --git a/tcg/tcg-op.c b/tcg/tcg-op.c
index 85f22458c9..06d3181fd0 100644
--- a/tcg/tcg-op.c
+++ b/tcg/tcg-op.c
@@ -3216,7 +3216,7 @@ static void canonicalize_memop_i128_as_i64(MemOp ret[2], MemOp orig)
 
 void tcg_gen_qemu_ld_i128(TCGv_i128 val, TCGv addr, TCGArg idx, MemOp memop)
 {
-    MemOpIdx oi = make_memop_idx(memop, idx);
+    const MemOpIdx oi = make_memop_idx(memop, idx);
 
     tcg_debug_assert((memop & MO_SIZE) == MO_128);
     tcg_debug_assert((memop & MO_SIGN) == 0);
@@ -3224,9 +3224,36 @@ void tcg_gen_qemu_ld_i128(TCGv_i128 val, TCGv addr, TCGArg idx, MemOp memop)
     tcg_gen_req_mo(TCG_MO_LD_LD | TCG_MO_ST_LD);
     addr = plugin_prep_mem_callbacks(addr);
 
-    /* TODO: allow the tcg backend to see the whole operation. */
+    /* TODO: For now, force 32-bit hosts to use the helper. */
+    if (TCG_TARGET_HAS_qemu_ldst_i128 && TCG_TARGET_REG_BITS == 64) {
+        TCGv_i64 lo, hi;
+        TCGArg addr_arg;
+        MemOpIdx adj_oi;
+        bool need_bswap = false;
 
-    if (use_two_i64_for_i128(memop)) {
+        if ((memop & MO_BSWAP) && !tcg_target_has_memory_bswap(memop)) {
+            lo = TCGV128_HIGH(val);
+            hi = TCGV128_LOW(val);
+            adj_oi = make_memop_idx(memop & ~MO_BSWAP, idx);
+            need_bswap = true;
+        } else {
+            lo = TCGV128_LOW(val);
+            hi = TCGV128_HIGH(val);
+            adj_oi = oi;
+        }
+
+#if TARGET_LONG_BITS == 32
+        addr_arg = tcgv_i32_arg(addr);
+#else
+        addr_arg = tcgv_i64_arg(addr);
+#endif
+        tcg_gen_op4ii_i64(INDEX_op_qemu_ld_i128, lo, hi, addr_arg, adj_oi);
+
+        if (need_bswap) {
+            tcg_gen_bswap64_i64(lo, lo);
+            tcg_gen_bswap64_i64(hi, hi);
+        }
+    } else if (use_two_i64_for_i128(memop)) {
         MemOp mop[2];
         TCGv addr_p8;
         TCGv_i64 x, y;
@@ -3269,7 +3296,7 @@ void tcg_gen_qemu_ld_i128(TCGv_i128 val, TCGv addr, TCGArg idx, MemOp memop)
 
 void tcg_gen_qemu_st_i128(TCGv_i128 val, TCGv addr, TCGArg idx, MemOp memop)
 {
-    MemOpIdx oi = make_memop_idx(memop, idx);
+    const MemOpIdx oi = make_memop_idx(memop, idx);
 
     tcg_debug_assert((memop & MO_SIZE) == MO_128);
     tcg_debug_assert((memop & MO_SIGN) == 0);
@@ -3277,9 +3304,39 @@ void tcg_gen_qemu_st_i128(TCGv_i128 val, TCGv addr, TCGArg idx, MemOp memop)
     tcg_gen_req_mo(TCG_MO_ST_LD | TCG_MO_ST_ST);
     addr = plugin_prep_mem_callbacks(addr);
 
-    /* TODO: allow the tcg backend to see the whole operation. */
+    /* TODO: For now, force 32-bit hosts to use the helper. */
 
-    if (use_two_i64_for_i128(memop)) {
+    if (TCG_TARGET_HAS_qemu_ldst_i128 && TCG_TARGET_REG_BITS == 64) {
+        TCGv_i64 lo, hi;
+        TCGArg addr_arg;
+        MemOpIdx adj_oi;
+        bool need_bswap = false;
+
+        if ((memop & MO_BSWAP) && !tcg_target_has_memory_bswap(memop)) {
+            lo = tcg_temp_new_i64();
+            hi = tcg_temp_new_i64();
+            tcg_gen_bswap64_i64(lo, TCGV128_HIGH(val));
+            tcg_gen_bswap64_i64(hi, TCGV128_LOW(val));
+            adj_oi = make_memop_idx(memop & ~MO_BSWAP, idx);
+            need_bswap = true;
+        } else {
+            lo = TCGV128_LOW(val);
+            hi = TCGV128_HIGH(val);
+            adj_oi = oi;
+        }
+
+#if TARGET_LONG_BITS == 32
+        addr_arg = tcgv_i32_arg(addr);
+#else
+        addr_arg = tcgv_i64_arg(addr);
+#endif
+        tcg_gen_op4ii_i64(INDEX_op_qemu_st_i128, lo, hi, addr_arg, adj_oi);
+
+        if (need_bswap) {
+            tcg_temp_free_i64(lo);
+            tcg_temp_free_i64(hi);
+        }
+    } else if (use_two_i64_for_i128(memop)) {
         MemOp mop[2];
         TCGv addr_p8;
         TCGv_i64 x, y;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index cb5ca9b612..b0e30a55ca 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -1718,6 +1718,10 @@ bool tcg_op_supported(TCGOpcode op)
     case INDEX_op_qemu_st8_i32:
         return TCG_TARGET_HAS_qemu_st8_i32;
 
+    case INDEX_op_qemu_ld_i128:
+    case INDEX_op_qemu_st_i128:
+        return TCG_TARGET_HAS_qemu_ldst_i128;
+
     case INDEX_op_mov_i32:
     case INDEX_op_setcond_i32:
     case INDEX_op_brcond_i32:
diff --git a/docs/devel/tcg-ops.rst b/docs/devel/tcg-ops.rst
index f3f451b77f..6a166c5665 100644
--- a/docs/devel/tcg-ops.rst
+++ b/docs/devel/tcg-ops.rst
@@ -672,19 +672,20 @@ QEMU specific operations
        | This operation is optional. If the TCG backend does not implement the
          goto_ptr opcode, emitting this op is equivalent to emitting exit_tb(0).
 
-   * - qemu_ld_i32/i64 *t0*, *t1*, *flags*, *memidx*
+   * - qemu_ld_i32/i64/i128 *t0*, *t1*, *flags*, *memidx*
 
-       qemu_st_i32/i64 *t0*, *t1*, *flags*, *memidx*
+       qemu_st_i32/i64/i128 *t0*, *t1*, *flags*, *memidx*
 
        qemu_st8_i32 *t0*, *t1*, *flags*, *memidx*
 
      - | Load data at the guest address *t1* into *t0*, or store data in *t0* at guest
-         address *t1*.  The _i32/_i64 size applies to the size of the input/output
+         address *t1*.  The _i32/_i64/_i128 size applies to the size of the input/output
          register *t0* only.  The address *t1* is always sized according to the guest,
          and the width of the memory operation is controlled by *flags*.
        |
        | Both *t0* and *t1* may be split into little-endian ordered pairs of registers
-         if dealing with 64-bit quantities on a 32-bit host.
+         if dealing with 64-bit quantities on a 32-bit host, or 128-bit quantities on
+         a 64-bit host.
        |
        | The *memidx* selects the qemu tlb index to use (e.g. user or kernel access).
          The flags are the MemOp bits, selecting the sign, width, and endianness
@@ -693,6 +694,8 @@ QEMU specific operations
        | For a 32-bit host, qemu_ld/st_i64 is guaranteed to only be used with a
          64-bit memory access specified in *flags*.
        |
+       | For qemu_ld/st_i128, these are only supported for a 64-bit host.
+       |
        | For i386, qemu_st8_i32 is exactly like qemu_st_i32, except the size of
          the memory operation is known to be 8-bit.  This allows the backend to
          provide a different set of register constraints.
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 41/57] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret}
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (39 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 40/57] tcg: Add INDEX_op_qemu_{ld,st}_i128 Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 12:53   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 42/57] tcg: Introduce atom_and_align_for_opc Richard Henderson
                   ` (16 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg.c | 174 ++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 148 insertions(+), 26 deletions(-)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index b0e30a55ca..3905d3041c 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -206,6 +206,7 @@ static void * const qemu_ld_helpers[MO_SSIZE + 1] __attribute__((unused)) = {
     [MO_UQ] = helper_ldq_mmu,
 #if TCG_TARGET_REG_BITS == 64
     [MO_SL] = helper_ldsl_mmu,
+    [MO_128] = helper_ld16_mmu,
 #endif
 };
 
@@ -214,6 +215,9 @@ static void * const qemu_st_helpers[MO_SIZE + 1] __attribute__((unused)) = {
     [MO_16] = helper_stw_mmu,
     [MO_32] = helper_stl_mmu,
     [MO_64] = helper_stq_mmu,
+#if TCG_TARGET_REG_BITS == 64
+    [MO_128] = helper_st16_mmu,
+#endif
 };
 
 TCGContext tcg_init_ctx;
@@ -773,6 +777,15 @@ static TCGHelperInfo info_helper_ld64_mmu = {
               | dh_typemask(ptr, 4)  /* uintptr_t ra */
 };
 
+static TCGHelperInfo info_helper_ld128_mmu = {
+    .flags = TCG_CALL_NO_WG,
+    .typemask = dh_typemask(i128, 0) /* return Int128 */
+              | dh_typemask(env, 1)
+              | dh_typemask(tl, 2)   /* target_ulong addr */
+              | dh_typemask(i32, 3)  /* unsigned oi */
+              | dh_typemask(ptr, 4)  /* uintptr_t ra */
+};
+
 static TCGHelperInfo info_helper_st32_mmu = {
     .flags = TCG_CALL_NO_WG,
     .typemask = dh_typemask(void, 0)
@@ -793,6 +806,16 @@ static TCGHelperInfo info_helper_st64_mmu = {
               | dh_typemask(ptr, 5)  /* uintptr_t ra */
 };
 
+static TCGHelperInfo info_helper_st128_mmu = {
+    .flags = TCG_CALL_NO_WG,
+    .typemask = dh_typemask(void, 0)
+              | dh_typemask(env, 1)
+              | dh_typemask(tl, 2)   /* target_ulong addr */
+              | dh_typemask(i128, 3) /* Int128 data */
+              | dh_typemask(i32, 4)  /* unsigned oi */
+              | dh_typemask(ptr, 5)  /* uintptr_t ra */
+};
+
 #ifdef CONFIG_TCG_INTERPRETER
 static ffi_type *typecode_to_ffi(int argmask)
 {
@@ -1206,8 +1229,10 @@ static void tcg_context_init(unsigned max_cpus)
 
     init_call_layout(&info_helper_ld32_mmu);
     init_call_layout(&info_helper_ld64_mmu);
+    init_call_layout(&info_helper_ld128_mmu);
     init_call_layout(&info_helper_st32_mmu);
     init_call_layout(&info_helper_st64_mmu);
+    init_call_layout(&info_helper_st128_mmu);
 
 #ifdef CONFIG_TCG_INTERPRETER
     init_ffi_layouts();
@@ -5361,6 +5386,9 @@ static void tcg_out_ld_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
     case MO_64:
         info = &info_helper_ld64_mmu;
         break;
+    case MO_128:
+        info = &info_helper_ld128_mmu;
+        break;
     default:
         g_assert_not_reached();
     }
@@ -5375,8 +5403,33 @@ static void tcg_out_ld_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
 
     tcg_out_helper_load_slots(s, nmov, mov, parm);
 
-    /* No special attention for 32 and 64-bit return values. */
-    tcg_debug_assert(info->out_kind == TCG_CALL_RET_NORMAL);
+    switch (info->out_kind) {
+    case TCG_CALL_RET_NORMAL:
+    case TCG_CALL_RET_BY_VEC:
+        break;
+    case TCG_CALL_RET_BY_REF:
+        /*
+         * The return reference is in the first argument slot.
+         * We need memory in which to return: re-use the top of stack.
+         */
+        {
+            int ofs_slot0 = arg_slot_stk_ofs(0);
+
+            if (arg_slot_reg_p(0)) {
+                tcg_out_addi_ptr(s, tcg_target_call_iarg_regs[0],
+                                 TCG_REG_CALL_STACK, ofs_slot0);
+            } else {
+                tcg_debug_assert(parm->ntmp != 0);
+                tcg_out_addi_ptr(s, parm->tmp[0],
+                                 TCG_REG_CALL_STACK, ofs_slot0);
+                tcg_out_st(s, TCG_TYPE_PTR, parm->tmp[0],
+                           TCG_REG_CALL_STACK, ofs_slot0);
+            }
+        }
+        break;
+    default:
+        g_assert_not_reached();
+    }
 
     tcg_out_helper_load_common_args(s, ldst, parm, info, next_arg);
 }
@@ -5385,11 +5438,18 @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
                                   bool load_sign,
                                   const TCGLdstHelperParam *parm)
 {
+    MemOp mop = get_memop(ldst->oi);
     TCGMovExtend mov[2];
+    int ofs_slot0;
 
-    if (ldst->type <= TCG_TYPE_REG) {
-        MemOp mop = get_memop(ldst->oi);
+    switch (ldst->type) {
+    case TCG_TYPE_I64:
+        if (TCG_TARGET_REG_BITS == 32) {
+            break;
+        }
+        /* fall through */
 
+    case TCG_TYPE_I32:
         mov[0].dst = ldst->datalo_reg;
         mov[0].src = tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, 0);
         mov[0].dst_type = ldst->type;
@@ -5415,25 +5475,49 @@ static void tcg_out_ld_helper_ret(TCGContext *s, const TCGLabelQemuLdst *ldst,
             mov[0].src_ext = mop & MO_SSIZE;
         }
         tcg_out_movext1(s, mov);
-    } else {
-        assert(TCG_TARGET_REG_BITS == 32);
+        return;
 
-        mov[0].dst = ldst->datalo_reg;
-        mov[0].src =
-            tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
-        mov[0].dst_type = TCG_TYPE_I32;
-        mov[0].src_type = TCG_TYPE_I32;
-        mov[0].src_ext = MO_32;
+    case TCG_TYPE_I128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        ofs_slot0 = arg_slot_stk_ofs(0);
+        switch (TCG_TARGET_CALL_RET_I128) {
+        case TCG_CALL_RET_NORMAL:
+            break;
+        case TCG_CALL_RET_BY_VEC:
+            tcg_out_st(s, TCG_TYPE_V128,
+                       tcg_target_call_oarg_reg(TCG_CALL_RET_BY_VEC, 0),
+                       TCG_REG_CALL_STACK, ofs_slot0);
+            /* fall through */
+        case TCG_CALL_RET_BY_REF:
+            tcg_out_ld(s, TCG_TYPE_I64, ldst->datalo_reg,
+                       TCG_REG_CALL_STACK, ofs_slot0 + 8 * HOST_BIG_ENDIAN);
+            tcg_out_ld(s, TCG_TYPE_I64, ldst->datahi_reg,
+                       TCG_REG_CALL_STACK, ofs_slot0 + 8 * !HOST_BIG_ENDIAN);
+            return;
+        default:
+            g_assert_not_reached();
+        }
+        break;
 
-        mov[1].dst = ldst->datahi_reg;
-        mov[1].src =
-            tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, !HOST_BIG_ENDIAN);
-        mov[1].dst_type = TCG_TYPE_REG;
-        mov[1].src_type = TCG_TYPE_REG;
-        mov[1].src_ext = MO_32;
-
-        tcg_out_movext2(s, mov, mov + 1, parm->ntmp ? parm->tmp[0] : -1);
+    default:
+        g_assert_not_reached();
     }
+
+    mov[0].dst = ldst->datalo_reg;
+    mov[0].src =
+        tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, HOST_BIG_ENDIAN);
+    mov[0].dst_type = TCG_TYPE_I32;
+    mov[0].src_type = TCG_TYPE_I32;
+    mov[0].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
+
+    mov[1].dst = ldst->datahi_reg;
+    mov[1].src =
+        tcg_target_call_oarg_reg(TCG_CALL_RET_NORMAL, !HOST_BIG_ENDIAN);
+    mov[1].dst_type = TCG_TYPE_REG;
+    mov[1].src_type = TCG_TYPE_REG;
+    mov[1].src_ext = TCG_TARGET_REG_BITS == 32 ? MO_32 : MO_64;
+
+    tcg_out_movext2(s, mov, mov + 1, parm->ntmp ? parm->tmp[0] : -1);
 }
 
 static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
@@ -5457,6 +5541,10 @@ static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
         info = &info_helper_st64_mmu;
         data_type = TCG_TYPE_I64;
         break;
+    case MO_128:
+        info = &info_helper_st128_mmu;
+        data_type = TCG_TYPE_I128;
+        break;
     default:
         g_assert_not_reached();
     }
@@ -5474,13 +5562,47 @@ static void tcg_out_st_helper_args(TCGContext *s, const TCGLabelQemuLdst *ldst,
 
     /* Handle data argument. */
     loc = &info->in[next_arg];
-    n = tcg_out_helper_add_mov(mov + nmov, loc, data_type, ldst->type,
-                               ldst->datalo_reg, ldst->datahi_reg);
-    next_arg += n;
-    nmov += n;
-    tcg_debug_assert(nmov <= ARRAY_SIZE(mov));
+    switch (loc->kind) {
+    case TCG_CALL_ARG_NORMAL:
+    case TCG_CALL_ARG_EXTEND_U:
+    case TCG_CALL_ARG_EXTEND_S:
+        n = tcg_out_helper_add_mov(mov + nmov, loc, data_type, ldst->type,
+                                   ldst->datalo_reg, ldst->datahi_reg);
+        next_arg += n;
+        nmov += n;
+        tcg_out_helper_load_slots(s, nmov, mov, parm);
+        break;
+
+    case TCG_CALL_ARG_BY_REF:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_debug_assert(data_type == TCG_TYPE_I128);
+        tcg_out_st(s, TCG_TYPE_I64,
+                   HOST_BIG_ENDIAN ? ldst->datahi_reg : ldst->datalo_reg,
+                   TCG_REG_CALL_STACK, arg_slot_stk_ofs(loc[0].ref_slot));
+        tcg_out_st(s, TCG_TYPE_I64,
+                   HOST_BIG_ENDIAN ? ldst->datalo_reg : ldst->datahi_reg,
+                   TCG_REG_CALL_STACK, arg_slot_stk_ofs(loc[1].ref_slot));
+
+        tcg_out_helper_load_slots(s, nmov, mov, parm);
+
+        if (arg_slot_reg_p(loc->arg_slot)) {
+            tcg_out_addi_ptr(s, tcg_target_call_iarg_regs[loc->arg_slot],
+                             TCG_REG_CALL_STACK,
+                             arg_slot_stk_ofs(loc->ref_slot));
+        } else {
+            tcg_debug_assert(parm->ntmp != 0);
+            tcg_out_addi_ptr(s, parm->tmp[0], TCG_REG_CALL_STACK,
+                             arg_slot_stk_ofs(loc->ref_slot));
+            tcg_out_st(s, TCG_TYPE_PTR, parm->tmp[0],
+                       TCG_REG_CALL_STACK, arg_slot_stk_ofs(loc->arg_slot));
+        }
+        next_arg += 2;
+        break;
+
+    default:
+        g_assert_not_reached();
+    }
 
-    tcg_out_helper_load_slots(s, nmov, mov, parm);
     tcg_out_helper_load_common_args(s, ldst, parm, info, next_arg);
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 42/57] tcg: Introduce atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (40 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 41/57] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret} Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:03   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 43/57] tcg/i386: Use atom_and_align_for_opc Richard Henderson
                   ` (15 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Examine MemOp for atomicity and alignment, adjusting alignment
as required to implement atomicity on the host.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/tcg.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/tcg/tcg.c b/tcg/tcg.c
index 3905d3041c..2422da64ac 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -220,6 +220,11 @@ static void * const qemu_st_helpers[MO_SIZE + 1] __attribute__((unused)) = {
 #endif
 };
 
+static MemOp atom_and_align_for_opc(TCGContext *s, MemOp *p_atom_a,
+                                    MemOp *p_atom_u, MemOp opc,
+                                    MemOp host_atom, bool allow_two_ops)
+    __attribute__((unused));
+
 TCGContext tcg_init_ctx;
 __thread TCGContext *tcg_ctx;
 
@@ -5123,6 +5128,70 @@ static void tcg_reg_alloc_call(TCGContext *s, TCGOp *op)
     }
 }
 
+/*
+ * Return the alignment and atomicity to use for the inline fast path
+ * for the given memory operation.  The alignment may be larger than
+ * that specified in @opc, and the correct alignment will be diagnosed
+ * by the slow path helper.
+ */
+static MemOp atom_and_align_for_opc(TCGContext *s, MemOp *p_atom_a,
+                                    MemOp *p_atom_u, MemOp opc,
+                                    MemOp host_atom, bool allow_two_ops)
+{
+    MemOp align = get_alignment_bits(opc);
+    MemOp atom, atmax, atmin, size = opc & MO_SIZE;
+
+    /* When serialized, no further atomicity required.  */
+    if (s->gen_tb->cflags & CF_PARALLEL) {
+        atom = opc & MO_ATOM_MASK;
+    } else {
+        atom = MO_ATOM_NONE;
+    }
+
+    atmax = opc & MO_ATMAX_MASK;
+    if (atmax == MO_ATMAX_SIZE) {
+        atmax = size;
+    } else {
+        atmax = atmax >> MO_ATMAX_SHIFT;
+    }
+
+    switch (atom) {
+    case MO_ATOM_NONE:
+        /* The operation requires no specific atomicity. */
+        atmax = atmin = MO_8;
+        break;
+    case MO_ATOM_IFALIGN:
+        /* If unaligned, the subobjects are bytes. */
+        atmin = MO_8;
+        break;
+    case MO_ATOM_WITHIN16:
+        /* If unaligned, there are subobjects if atmax < size. */
+        atmin = (atmax < size ? atmax : MO_8);
+        atmax = size;
+        break;
+    case MO_ATOM_SUBALIGN:
+        /* If unaligned but not odd, there are subobjects up to atmax - 1. */
+        atmin = (atmax == MO_8 ? MO_8 : atmax - 1);
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    /*
+     * If there are subobjects, and the host model does not match, then we
+     * need to raise the initial alignment check.  If the backend is prepared
+     * to double-check alignment and issue two half size ops, we need not
+     * raise initial alignment beyond half.
+     */
+    if (atmin > MO_8 && host_atom != atom) {
+        align = MAX(align, size - allow_two_ops);
+    }
+
+    *p_atom_a = atmax;
+    *p_atom_u = atmin;
+    return align;
+}
+
 /*
  * Similarly for qemu_ld/st slow path helpers.
  * We must re-implement tcg_gen_callN and tcg_reg_alloc_call simultaneously,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 43/57] tcg/i386: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (41 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 42/57] tcg: Introduce atom_and_align_for_opc Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:14   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 44/57] tcg/aarch64: " Richard Henderson
                   ` (14 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

No change to the ultimate load/store routines yet, so some atomicity
conditions not yet honored, but plumbs the change to alignment through
the relevant functions.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.c.inc | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index 7c72bf6684..3e21f067d6 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -1774,6 +1774,8 @@ typedef struct {
     int index;
     int ofs;
     int seg;
+    MemOp align;
+    MemOp atom;
 } HostAddress;
 
 bool tcg_target_has_memory_bswap(MemOp memop)
@@ -1895,8 +1897,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned a_mask = (1 << a_bits) - 1;
+    MemOp atom_u;
+    unsigned a_mask;
+
+    h->align = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
+                                      MO_ATOM_IFALIGN, false);
+    a_mask = (1 << h->align) - 1;
 
 #ifdef CONFIG_SOFTMMU
     int cmp_ofs = is_ld ? offsetof(CPUTLBEntry, addr_read)
@@ -1941,10 +1947,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
                          TLB_MASK_TABLE_OFS(mem_index) +
                          offsetof(CPUTLBDescFast, table));
 
-    /* If the required alignment is at least as large as the access, simply
-       copy the address and mask.  For lesser alignments, check that we don't
-       cross pages for the complete access.  */
-    if (a_bits >= s_bits) {
+    /*
+     * If the required alignment is at least as large as the access, simply
+     * copy the address and mask.  For lesser alignments, check that we don't
+     * cross pages for the complete access.
+     */
+    if (a_mask >= s_mask) {
         tcg_out_mov(s, ttype, TCG_REG_L1, addrlo);
     } else {
         tcg_out_modrm_offset(s, OPC_LEA + trexw, TCG_REG_L1,
@@ -1976,12 +1984,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_L0, TCG_REG_L0,
                offsetof(CPUTLBEntry, addend));
 
-    *h = (HostAddress) {
-        .base = addrlo,
-        .index = TCG_REG_L0,
-    };
+    h->base = addrlo;
+    h->index = TCG_REG_L0;
+    h->ofs = 0;
+    h->seg = 0;
 #else
-    if (a_bits) {
+    if (a_mask) {
         ldst = new_ldst_label(s);
 
         ldst->is_ld = is_ld;
@@ -1996,8 +2004,10 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
         s->code_ptr += 4;
     }
 
-    *h = x86_guest_base;
     h->base = addrlo;
+    h->index = x86_guest_base.index;
+    h->ofs = x86_guest_base.ofs;
+    h->seg = x86_guest_base.seg;
 #endif
 
     return ldst;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 44/57] tcg/aarch64: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (42 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 43/57] tcg/i386: Use atom_and_align_for_opc Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:15   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 45/57] tcg/arm: " Richard Henderson
                   ` (13 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 38 +++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index 8e5f3d3688..1d6d382edd 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -1593,6 +1593,8 @@ typedef struct {
     TCGReg base;
     TCGReg index;
     TCGType index_ext;
+    MemOp align;
+    MemOp atom;
 } HostAddress;
 
 bool tcg_target_has_memory_bswap(MemOp memop)
@@ -1646,8 +1648,14 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned a_mask = (1u << a_bits) - 1;
+    MemOp atom_u;
+    unsigned a_mask;
+
+    h->align = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
+                                      have_lse2 ? MO_ATOM_WITHIN16
+                                                : MO_ATOM_IFALIGN,
+                                      false);
+    a_mask = (1 << h->align) - 1;
 
 #ifdef CONFIG_SOFTMMU
     unsigned s_bits = opc & MO_SIZE;
@@ -1693,7 +1701,7 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * bits within the address.  For unaligned access, we check that we don't
      * cross pages using the address of the last byte of the access.
      */
-    if (a_bits >= s_bits) {
+    if (a_mask >= s_mask) {
         x3 = addr_reg;
     } else {
         tcg_out_insn(s, 3401, ADDI, TARGET_LONG_BITS == 64,
@@ -1713,11 +1721,9 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     ldst->label_ptr[0] = s->code_ptr;
     tcg_out_insn(s, 3202, B_C, TCG_COND_NE, 0);
 
-    *h = (HostAddress){
-        .base = TCG_REG_X1,
-        .index = addr_reg,
-        .index_ext = addr_type
-    };
+    h->base = TCG_REG_X1,
+    h->index = addr_reg;
+    h->index_ext = addr_type;
 #else
     if (a_mask) {
         ldst = new_ldst_label(s);
@@ -1735,17 +1741,13 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     }
 
     if (USE_GUEST_BASE) {
-        *h = (HostAddress){
-            .base = TCG_REG_GUEST_BASE,
-            .index = addr_reg,
-            .index_ext = addr_type
-        };
+        h->base = TCG_REG_GUEST_BASE;
+        h->index = addr_reg;
+        h->index_ext = addr_type;
     } else {
-        *h = (HostAddress){
-            .base = addr_reg,
-            .index = TCG_REG_XZR,
-            .index_ext = TCG_TYPE_I64
-        };
+        h->base = addr_reg;
+        h->index = TCG_REG_XZR;
+        h->index_ext = TCG_TYPE_I64;
     }
 #endif
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 45/57] tcg/arm: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (43 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 44/57] tcg/aarch64: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:15   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 46/57] tcg/loongarch64: " Richard Henderson
                   ` (12 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

No change to the ultimate load/store routines yet, so some atomicity
conditions not yet honored, but plumbs the change to alignment through
the relevant functions.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/arm/tcg-target.c.inc | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/tcg/arm/tcg-target.c.inc b/tcg/arm/tcg-target.c.inc
index e5aed03247..edd995e04f 100644
--- a/tcg/arm/tcg-target.c.inc
+++ b/tcg/arm/tcg-target.c.inc
@@ -1323,6 +1323,8 @@ typedef struct {
     TCGReg base;
     int index;
     bool index_scratch;
+    MemOp align;
+    MemOp atom;
 } HostAddress;
 
 bool tcg_target_has_memory_bswap(MemOp memop)
@@ -1379,8 +1381,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    MemOp a_bits = get_alignment_bits(opc);
-    unsigned a_mask = (1 << a_bits) - 1;
+    MemOp a_bits, atom_a, atom_u;
+    unsigned a_mask;
+
+    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
+                                    MO_ATOM_IFALIGN, false);
+    a_mask = (1 << a_bits) - 1;
 
 #ifdef CONFIG_SOFTMMU
     int mem_index = get_mmuidx(oi);
@@ -1498,6 +1504,9 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     };
 #endif
 
+    h->align = a_bits;
+    h->atom = atom_a;
+
     return ldst;
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 46/57] tcg/loongarch64: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (44 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 45/57] tcg/arm: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:16   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 47/57] tcg/mips: " Richard Henderson
                   ` (11 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/loongarch64/tcg-target.c.inc | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
index 62bf823084..43341524f2 100644
--- a/tcg/loongarch64/tcg-target.c.inc
+++ b/tcg/loongarch64/tcg-target.c.inc
@@ -826,6 +826,8 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 typedef struct {
     TCGReg base;
     TCGReg index;
+    MemOp align;
+    MemOp atom;
 } HostAddress;
 
 bool tcg_target_has_memory_bswap(MemOp memop)
@@ -845,7 +847,11 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
+    MemOp a_bits, atom_u;
+
+    a_bits = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
+                                    MO_ATOM_IFALIGN, false);
+    h->align = a_bits;
 
 #ifdef CONFIG_SOFTMMU
     unsigned s_bits = opc & MO_SIZE;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 47/57] tcg/mips: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (45 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 46/57] tcg/loongarch64: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:17   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 48/57] tcg/ppc: " Richard Henderson
                   ` (10 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/mips/tcg-target.c.inc | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
index cd0254a0d7..43a8ffac17 100644
--- a/tcg/mips/tcg-target.c.inc
+++ b/tcg/mips/tcg-target.c.inc
@@ -1139,6 +1139,7 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
 typedef struct {
     TCGReg base;
     MemOp align;
+    MemOp atom;
 } HostAddress;
 
 bool tcg_target_has_memory_bswap(MemOp memop)
@@ -1158,11 +1159,16 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
+    MemOp a_bits, atom_u;
     unsigned s_bits = opc & MO_SIZE;
-    unsigned a_mask = (1 << a_bits) - 1;
+    unsigned a_mask;
     TCGReg base;
 
+    a_bits = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
+                                    MO_ATOM_IFALIGN, false);
+    h->align = a_bits;
+    a_mask = (1 << a_bits) - 1;
+
 #ifdef CONFIG_SOFTMMU
     unsigned s_mask = (1 << s_bits) - 1;
     int mem_index = get_mmuidx(oi);
@@ -1281,7 +1287,6 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 #endif
 
     h->base = base;
-    h->align = a_bits;
     return ldst;
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 48/57] tcg/ppc: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (46 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 47/57] tcg/mips: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:18   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 49/57] tcg/riscv: " Richard Henderson
                   ` (9 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target.c.inc | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index f0a4118bbb..60375804cd 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -2034,7 +2034,22 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
+    MemOp a_bits, atom_a, atom_u;
+
+    /*
+     * Book II, Section 1.4, Single-Copy Atomicity, specifies:
+     *
+     * Before 3.0, "An access that is not atomic is performed as a set of
+     * smaller disjoint atomic accesses. In general, the number and alignment
+     * of these accesses are implementation-dependent."  Thus MO_ATOM_IFALIGN.
+     *
+     * As of 3.0, "the non-atomic access is performed as described in
+     * the corresponding list", which matches MO_ATOM_SUBALIGN.
+     */
+    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
+                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
+                                                  : MO_ATOM_IFALIGN,
+                                    false);
 
 #ifdef CONFIG_SOFTMMU
     int mem_index = get_mmuidx(oi);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 49/57] tcg/riscv: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (47 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 48/57] tcg/ppc: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:19   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 50/57] tcg/s390x: " Richard Henderson
                   ` (8 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/riscv/tcg-target.c.inc | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
index 37870c89fc..4dd33c73e8 100644
--- a/tcg/riscv/tcg-target.c.inc
+++ b/tcg/riscv/tcg-target.c.inc
@@ -910,8 +910,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned a_mask = (1u << a_bits) - 1;
+    MemOp a_bits, atom_a, atom_u;
+    unsigned a_mask;
+
+    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
+                                    MO_ATOM_IFALIGN, false);
+    a_mask = (1u << a_bits) - 1;
 
 #ifdef CONFIG_SOFTMMU
     unsigned s_bits = opc & MO_SIZE;
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 50/57] tcg/s390x: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (48 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 49/57] tcg/riscv: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:20   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 51/57] tcg/sparc64: " Richard Henderson
                   ` (7 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target.c.inc | 14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index 22f0206b5a..ddd9860a6a 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -1572,6 +1572,8 @@ typedef struct {
     TCGReg base;
     TCGReg index;
     int disp;
+    MemOp align;
+    MemOp atom;
 } HostAddress;
 
 bool tcg_target_has_memory_bswap(MemOp memop)
@@ -1733,8 +1735,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned a_mask = (1u << a_bits) - 1;
+    MemOp atom_u;
+    unsigned a_mask;
+
+    h->align = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
+                                      MO_ATOM_IFALIGN, false);
+    a_mask = (1 << h->align) - 1;
 
 #ifdef CONFIG_SOFTMMU
     unsigned s_bits = opc & MO_SIZE;
@@ -1764,7 +1770,7 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * bits within the address.  For unaligned access, we check that we don't
      * cross pages using the address of the last byte of the access.
      */
-    a_off = (a_bits >= s_bits ? 0 : s_mask - a_mask);
+    a_off = (a_mask >= s_mask ? 0 : s_mask - a_mask);
     tlb_mask = (uint64_t)TARGET_PAGE_MASK | a_mask;
     if (a_off == 0) {
         tgen_andi_risbg(s, TCG_REG_R0, addr_reg, tlb_mask);
@@ -1806,7 +1812,7 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
         ldst->addrlo_reg = addr_reg;
 
         /* We are expecting a_bits to max out at 7, much lower than TMLL. */
-        tcg_debug_assert(a_bits < 16);
+        tcg_debug_assert(a_mask <= 0xffff);
         tcg_out_insn(s, RI, TMLL, addr_reg, a_mask);
 
         tcg_out16(s, RI_BRC | (7 << 4)); /* CC in {1,2,3} */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 51/57] tcg/sparc64: Use atom_and_align_for_opc
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (49 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 50/57] tcg/s390x: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:20   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode Richard Henderson
                   ` (6 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/sparc64/tcg-target.c.inc | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
index bb23038529..4f9ec02b1f 100644
--- a/tcg/sparc64/tcg-target.c.inc
+++ b/tcg/sparc64/tcg-target.c.inc
@@ -1028,11 +1028,13 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    unsigned a_bits = get_alignment_bits(opc);
-    unsigned s_bits = opc & MO_SIZE;
+    MemOp s_bits = opc & MO_SIZE;
+    MemOp a_bits, atom_a, atom_u;
     unsigned a_mask;
 
     /* We don't support unaligned accesses. */
+    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
+                                    MO_ATOM_IFALIGN, false);
     a_bits = MAX(a_bits, s_bits);
     a_mask = (1u << a_bits) - 1;
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (50 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 51/57] tcg/sparc64: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:27   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 53/57] tcg/i386: Support 128-bit load/store with have_atomic16 Richard Henderson
                   ` (5 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Use the fpu to perform 64-bit loads and stores.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.c.inc | 44 +++++++++++++++++++++++++++++++++------
 1 file changed, 38 insertions(+), 6 deletions(-)

diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index 3e21f067d6..5c6c64c48a 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -468,6 +468,10 @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 #define OPC_GRP5        (0xff)
 #define OPC_GRP14       (0x73 | P_EXT | P_DATA16)
 
+#define OPC_ESCDF       (0xdf)
+#define ESCDF_FILD_m64  5
+#define ESCDF_FISTP_m64 7
+
 /* Group 1 opcode extensions for 0x80-0x83.
    These are also used as modifiers for OPC_ARITH.  */
 #define ARITH_ADD 0
@@ -2091,7 +2095,20 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
             datalo = datahi;
             datahi = t;
         }
-        if (h.base == datalo || h.index == datalo) {
+        if (h.atom == MO_64) {
+            /*
+             * Atomicity requires that we use use a single 8-byte load.
+             * For simplicity and code size, always use the FPU for this.
+             * Similar insns using SSE/AVX are merely larger.
+             * Load from memory in one go, then store back to the stack,
+             * from whence we can load into the correct integer regs.
+             */
+            tcg_out_modrm_sib_offset(s, OPC_ESCDF + h.seg, ESCDF_FILD_m64,
+                                     h.base, h.index, 0, h.ofs);
+            tcg_out_modrm_offset(s, OPC_ESCDF, ESCDF_FISTP_m64, TCG_REG_ESP, 0);
+            tcg_out_modrm_offset(s, movop, datalo, TCG_REG_ESP, 0);
+            tcg_out_modrm_offset(s, movop, datahi, TCG_REG_ESP, 4);
+        } else if (h.base == datalo || h.index == datalo) {
             tcg_out_modrm_sib_offset(s, OPC_LEA, datahi,
                                      h.base, h.index, 0, h.ofs);
             tcg_out_modrm_offset(s, movop + h.seg, datalo, datahi, 0);
@@ -2161,12 +2178,27 @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
         if (TCG_TARGET_REG_BITS == 64) {
             tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
                                      h.base, h.index, 0, h.ofs);
+            break;
+        }
+        if (use_movbe) {
+            TCGReg t = datalo;
+            datalo = datahi;
+            datahi = t;
+        }
+        if (h.atom == MO_64) {
+            /*
+             * Atomicity requires that we use use one 8-byte store.
+             * For simplicity, and code size, always use the FPU for this.
+             * Similar insns using SSE/AVX are merely larger.
+             * Assemble the 8-byte quantity in required endianness
+             * on the stack, load to coproc unit, and store.
+             */
+            tcg_out_modrm_offset(s, movop, datalo, TCG_REG_ESP, 0);
+            tcg_out_modrm_offset(s, movop, datahi, TCG_REG_ESP, 4);
+            tcg_out_modrm_offset(s, OPC_ESCDF, ESCDF_FILD_m64, TCG_REG_ESP, 0);
+            tcg_out_modrm_sib_offset(s, OPC_ESCDF + h.seg, ESCDF_FISTP_m64,
+                                     h.base, h.index, 0, h.ofs);
         } else {
-            if (use_movbe) {
-                TCGReg t = datalo;
-                datalo = datahi;
-                datahi = t;
-            }
             tcg_out_modrm_sib_offset(s, movop + h.seg, datalo,
                                      h.base, h.index, 0, h.ofs);
             tcg_out_modrm_sib_offset(s, movop + h.seg, datahi,
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 53/57] tcg/i386: Support 128-bit load/store with have_atomic16
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (51 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:34   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 54/57] tcg/aarch64: Rename temporaries Richard Henderson
                   ` (4 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/i386/tcg-target.h     |   3 +-
 tcg/i386/tcg-target.c.inc | 184 +++++++++++++++++++++++++++++++++++++-
 2 files changed, 182 insertions(+), 5 deletions(-)

diff --git a/tcg/i386/tcg-target.h b/tcg/i386/tcg-target.h
index 943af6775e..7f69997e30 100644
--- a/tcg/i386/tcg-target.h
+++ b/tcg/i386/tcg-target.h
@@ -194,7 +194,8 @@ extern bool have_atomic16;
 #define TCG_TARGET_HAS_qemu_st8_i32     1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128 \
+    (TCG_TARGET_REG_BITS == 64 && have_atomic16)
 
 /* We do not support older SSE systems, only beginning with AVX1.  */
 #define TCG_TARGET_HAS_v64              have_avx1
diff --git a/tcg/i386/tcg-target.c.inc b/tcg/i386/tcg-target.c.inc
index 5c6c64c48a..a2739977a6 100644
--- a/tcg/i386/tcg-target.c.inc
+++ b/tcg/i386/tcg-target.c.inc
@@ -91,6 +91,8 @@ static const int tcg_target_reg_alloc_order[] = {
 #endif
 };
 
+#define TCG_TMP_VEC  TCG_REG_XMM5
+
 static const int tcg_target_call_iarg_regs[] = {
 #if TCG_TARGET_REG_BITS == 64
 #if defined(_WIN64)
@@ -347,6 +349,8 @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 #define OPC_PCMPGTW     (0x65 | P_EXT | P_DATA16)
 #define OPC_PCMPGTD     (0x66 | P_EXT | P_DATA16)
 #define OPC_PCMPGTQ     (0x37 | P_EXT38 | P_DATA16)
+#define OPC_PEXTRD      (0x16 | P_EXT3A | P_DATA16)
+#define OPC_PINSRD      (0x22 | P_EXT3A | P_DATA16)
 #define OPC_PMAXSB      (0x3c | P_EXT38 | P_DATA16)
 #define OPC_PMAXSW      (0xee | P_EXT | P_DATA16)
 #define OPC_PMAXSD      (0x3d | P_EXT38 | P_DATA16)
@@ -1784,7 +1788,22 @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return have_movbe;
+    MemOp atom_a, atom_u;
+
+    if (!have_movbe) {
+        return false;
+    }
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity, i.e. VMOVDQA,
+     * but do allow a pair of 64-bit operations, i.e. MOVBEQ.
+     */
+    (void)atom_and_align_for_opc(tcg_ctx, &atom_a, &atom_u, memop,
+                                 MO_ATOM_IFALIGN, true);
+    return atom_a <= MO_64;
 }
 
 /*
@@ -1812,6 +1831,30 @@ static const TCGLdstHelperParam ldst_helper_param = {
 static const TCGLdstHelperParam ldst_helper_param = { };
 #endif
 
+static void tcg_out_vec_to_pair(TCGContext *s, TCGType type,
+                                TCGReg l, TCGReg h, TCGReg v)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vpmov{d,q} %v, %l */
+    tcg_out_vex_modrm(s, OPC_MOVD_EyVy + rexw, v, 0, l);
+    /* vpextr{d,q} $1, %v, %h */
+    tcg_out_vex_modrm(s, OPC_PEXTRD + rexw, v, 0, h);
+    tcg_out8(s, 1);
+}
+
+static void tcg_out_pair_to_vec(TCGContext *s, TCGType type,
+                                TCGReg v, TCGReg l, TCGReg h)
+{
+    int rexw = type == TCG_TYPE_I32 ? 0 : P_REXW;
+
+    /* vmov{d,q} %l, %v */
+    tcg_out_vex_modrm(s, OPC_MOVD_VyEy + rexw, v, 0, l);
+    /* vpinsr{d,q} $1, %h, %v, %v */
+    tcg_out_vex_modrm(s, OPC_PINSRD + rexw, v, v, h);
+    tcg_out8(s, 1);
+}
+
 /*
  * Generate code for the slow path for a load at the end of block
  */
@@ -1901,11 +1944,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    MemOp atom_u;
+    MemOp atom_u, s_bits;
     unsigned a_mask;
 
+    s_bits = opc & MO_SIZE;
     h->align = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
-                                      MO_ATOM_IFALIGN, false);
+                                      MO_ATOM_IFALIGN, s_bits == MO_128);
     a_mask = (1 << h->align) - 1;
 
 #ifdef CONFIG_SOFTMMU
@@ -1915,7 +1959,6 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType tlbtype = TCG_TYPE_I32;
     int trexw = 0, hrexw = 0, tlbrexw = 0;
     unsigned mem_index = get_mmuidx(oi);
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1 << s_bits) - 1;
     target_ulong tlb_mask;
 
@@ -2120,6 +2163,69 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        {
+            TCGLabel *l1 = NULL, *l2 = NULL;
+            bool use_pair = h.align < MO_128;
+
+            tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+            if (!use_pair) {
+                tcg_debug_assert(!use_movbe);
+                /*
+                 * Atomicity requires that we use use VMOVDQA.
+                 * If we've already checked for 16-byte alignment, that's all
+                 * we need.  If we arrive here with lesser alignment, then we
+                 * have determined that less than 16-byte alignment can be
+                 * satisfied with two 8-byte loads.
+                 */
+                if (h.align < MO_128) {
+                    use_pair = true;
+                    l1 = gen_new_label();
+                    l2 = gen_new_label();
+
+                    tcg_out_testi(s, h.base, 15);
+                    tcg_out_jxx(s, JCC_JNE, l2, true);
+                }
+
+                tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_VxWx + h.seg,
+                                             TCG_TMP_VEC, 0,
+                                             h.base, h.index, 0, h.ofs);
+                tcg_out_vec_to_pair(s, TCG_TYPE_I64, datalo,
+                                    datahi, TCG_TMP_VEC);
+
+                if (use_pair) {
+                    tcg_out_jxx(s, JCC_JMP, l1, true);
+                    tcg_out_label(s, l2);
+                }
+            }
+            if (use_pair) {
+                if (use_movbe) {
+                    TCGReg t = datalo;
+                    datalo = datahi;
+                    datahi = t;
+                }
+                if (h.base == datalo || h.index == datalo) {
+                    tcg_out_modrm_sib_offset(s, OPC_LEA, datahi,
+                                             h.base, h.index, 0, h.ofs);
+                    tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                         datalo, datahi, 0);
+                    tcg_out_modrm_offset(s, movop + P_REXW + h.seg,
+                                         datahi, datahi, 8);
+                } else {
+                    tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                             h.base, h.index, 0, h.ofs);
+                    tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                             h.base, h.index, 0, h.ofs + 8);
+                }
+            }
+            if (l1) {
+                tcg_out_label(s, l1);
+            }
+        }
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -2205,6 +2311,60 @@ static void tcg_out_qemu_st_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
                                      h.base, h.index, 0, h.ofs + 4);
         }
         break;
+
+    case MO_128:
+        {
+            TCGLabel *l1 = NULL, *l2 = NULL;
+            bool use_pair = h.align < MO_128;
+
+            tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+
+            if (!use_pair) {
+                tcg_debug_assert(!use_movbe);
+                /*
+                 * Atomicity requires that we use use VMOVDQA.
+                 * If we've already checked for 16-byte alignment, that's all
+                 * we need.  If we arrive here with lesser alignment, then we
+                 * have determined that less that 16-byte alignment can be
+                 * satisfied with two 8-byte loads.
+                 */
+                if (h.align < MO_128) {
+                    use_pair = true;
+                    l1 = gen_new_label();
+                    l2 = gen_new_label();
+
+                    tcg_out_testi(s, h.base, 15);
+                    tcg_out_jxx(s, JCC_JNE, l2, true);
+                }
+
+                tcg_out_pair_to_vec(s, TCG_TYPE_I64, TCG_TMP_VEC,
+                                    datalo, datahi);
+                tcg_out_vex_modrm_sib_offset(s, OPC_MOVDQA_WxVx + h.seg,
+                                             TCG_TMP_VEC, 0,
+                                             h.base, h.index, 0, h.ofs);
+
+                if (use_pair) {
+                    tcg_out_jxx(s, JCC_JMP, l1, true);
+                    tcg_out_label(s, l2);
+                }
+            }
+            if (use_pair) {
+                if (use_movbe) {
+                    TCGReg t = datalo;
+                    datalo = datahi;
+                    datahi = t;
+                }
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datalo,
+                                         h.base, h.index, 0, h.ofs);
+                tcg_out_modrm_sib_offset(s, movop + P_REXW + h.seg, datahi,
+                                         h.base, h.index, 0, h.ofs + 8);
+            }
+            if (l1) {
+                tcg_out_label(s, l1);
+            }
+        }
+        break;
+
     default:
         g_assert_not_reached();
     }
@@ -2528,6 +2688,10 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_ld(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ld(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
     case INDEX_op_qemu_st_i32:
     case INDEX_op_qemu_st8_i32:
         if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
@@ -2545,6 +2709,10 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
             tcg_out_qemu_st(s, a0, a1, a2, args[3], args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_st(s, a0, a1, a2, -1, args[3], TCG_TYPE_I128);
+        break;
 
     OP_32_64(mulu2):
         tcg_out_modrm(s, OPC_GRP3_Ev + rexw, EXT3_MUL, args[3]);
@@ -3239,6 +3407,13 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
                 : TARGET_LONG_BITS <= TCG_TARGET_REG_BITS ? C_O0_I3(L, L, L)
                 : C_O0_I4(L, L, L, L));
 
+    case INDEX_op_qemu_ld_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O2_I1(r, r, L);
+    case INDEX_op_qemu_st_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        return C_O0_I3(L, L, L);
+
     case INDEX_op_brcond2_i32:
         return C_O0_I4(r, r, ri, ri);
 
@@ -4095,6 +4270,7 @@ static void tcg_target_init(TCGContext *s)
 
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_CALL_STACK);
+    tcg_regset_set_reg(s->reserved_regs, TCG_TMP_VEC);
 #ifdef _WIN64
     /* These are call saved, and we don't save them, so don't use them. */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_XMM6);
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 54/57] tcg/aarch64: Rename temporaries
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (52 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 53/57] tcg/i386: Support 128-bit load/store with have_atomic16 Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:36   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 55/57] tcg/aarch64: Support 128-bit load/store Richard Henderson
                   ` (3 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

We will need to allocate a second general-purpose temporary.
Rename the existing temps to add a distinguishing number.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target.c.inc | 50 ++++++++++++++++++------------------
 1 file changed, 25 insertions(+), 25 deletions(-)

diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index 1d6d382edd..76a6bfd202 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -80,8 +80,8 @@ static TCGReg tcg_target_call_oarg_reg(TCGCallReturnKind kind, int slot)
 bool have_lse;
 bool have_lse2;
 
-#define TCG_REG_TMP TCG_REG_X30
-#define TCG_VEC_TMP TCG_REG_V31
+#define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
 /* Note that XZR cannot be encoded in the address base register slot,
@@ -998,7 +998,7 @@ static bool tcg_out_dup_vec(TCGContext *s, TCGType type, unsigned vece,
 static bool tcg_out_dupm_vec(TCGContext *s, TCGType type, unsigned vece,
                              TCGReg r, TCGReg base, intptr_t offset)
 {
-    TCGReg temp = TCG_REG_TMP;
+    TCGReg temp = TCG_REG_TMP0;
 
     if (offset < -0xffffff || offset > 0xffffff) {
         tcg_out_movi(s, TCG_TYPE_PTR, temp, offset);
@@ -1150,8 +1150,8 @@ static void tcg_out_ldst(TCGContext *s, AArch64Insn insn, TCGReg rd,
     }
 
     /* Worst-case scenario, move offset to temp register, use reg offset.  */
-    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, offset);
-    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP);
+    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, offset);
+    tcg_out_ldst_r(s, insn, rd, rn, TCG_TYPE_I64, TCG_REG_TMP0);
 }
 
 static bool tcg_out_mov(TCGContext *s, TCGType type, TCGReg ret, TCGReg arg)
@@ -1367,8 +1367,8 @@ static void tcg_out_call_int(TCGContext *s, const tcg_insn_unit *target)
     if (offset == sextract64(offset, 0, 26)) {
         tcg_out_insn(s, 3206, BL, offset);
     } else {
-        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP, (intptr_t)target);
-        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP);
+        tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_TMP0, (intptr_t)target);
+        tcg_out_insn(s, 3207, BLR, TCG_REG_TMP0);
     }
 }
 
@@ -1505,7 +1505,7 @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
     AArch64Insn insn;
 
     if (rl == ah || (!const_bh && rl == bh)) {
-        rl = TCG_REG_TMP;
+        rl = TCG_REG_TMP0;
     }
 
     if (const_bl) {
@@ -1522,7 +1522,7 @@ static void tcg_out_addsub2(TCGContext *s, TCGType ext, TCGReg rl,
                possibility of adding 0+const in the low part, and the
                immediate add instructions encode XSP not XZR.  Don't try
                anything more elaborate here than loading another zero.  */
-            al = TCG_REG_TMP;
+            al = TCG_REG_TMP0;
             tcg_out_movi(s, ext, al, 0);
         }
         tcg_out_insn_3401(s, insn, ext, rl, al, bl);
@@ -1563,7 +1563,7 @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
 {
     TCGReg a1 = a0;
     if (is_ctz) {
-        a1 = TCG_REG_TMP;
+        a1 = TCG_REG_TMP0;
         tcg_out_insn(s, 3507, RBIT, ext, a1, a0);
     }
     if (const_b && b == (ext ? 64 : 32)) {
@@ -1572,7 +1572,7 @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
         AArch64Insn sel = I3506_CSEL;
 
         tcg_out_cmp(s, ext, a0, 0, 1);
-        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP, a1);
+        tcg_out_insn(s, 3507, CLZ, ext, TCG_REG_TMP0, a1);
 
         if (const_b) {
             if (b == -1) {
@@ -1585,7 +1585,7 @@ static void tcg_out_cltz(TCGContext *s, TCGType ext, TCGReg d,
                 b = d;
             }
         }
-        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP, b, TCG_COND_NE);
+        tcg_out_insn_3506(s, sel, ext, d, TCG_REG_TMP0, b, TCG_COND_NE);
     }
 }
 
@@ -1603,7 +1603,7 @@ bool tcg_target_has_memory_bswap(MemOp memop)
 }
 
 static const TCGLdstHelperParam ldst_helper_param = {
-    .ntmp = 1, .tmp = { TCG_REG_TMP }
+    .ntmp = 1, .tmp = { TCG_REG_TMP0 }
 };
 
 static bool tcg_out_qemu_ld_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
@@ -1864,7 +1864,7 @@ static void tcg_out_goto_tb(TCGContext *s, int which)
 
     set_jmp_insn_offset(s, which);
     tcg_out32(s, I3206_B);
-    tcg_out_insn(s, 3207, BR, TCG_REG_TMP);
+    tcg_out_insn(s, 3207, BR, TCG_REG_TMP0);
     set_jmp_reset_offset(s, which);
 }
 
@@ -1883,7 +1883,7 @@ void tb_target_set_jmp_target(const TranslationBlock *tb, int n,
         ptrdiff_t i_offset = i_addr - jmp_rx;
 
         /* Note that we asserted this in range in tcg_out_goto_tb. */
-        insn = deposit32(I3305_LDR | TCG_REG_TMP, 5, 19, i_offset >> 2);
+        insn = deposit32(I3305_LDR | TCG_REG_TMP0, 5, 19, i_offset >> 2);
     }
     qatomic_set((uint32_t *)jmp_rw, insn);
     flush_idcache_range(jmp_rx, jmp_rw, 4);
@@ -2079,13 +2079,13 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
 
     case INDEX_op_rem_i64:
     case INDEX_op_rem_i32:
-        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, SDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
     case INDEX_op_remu_i64:
     case INDEX_op_remu_i32:
-        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP, a1, a2);
-        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP, a2, a1);
+        tcg_out_insn(s, 3508, UDIV, ext, TCG_REG_TMP0, a1, a2);
+        tcg_out_insn(s, 3509, MSUB, ext, a0, TCG_REG_TMP0, a2, a1);
         break;
 
     case INDEX_op_shl_i64:
@@ -2129,8 +2129,8 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
         if (c2) {
             tcg_out_rotl(s, ext, a0, a1, a2);
         } else {
-            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP, TCG_REG_XZR, a2);
-            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP);
+            tcg_out_insn(s, 3502, SUB, 0, TCG_REG_TMP0, TCG_REG_XZR, a2);
+            tcg_out_insn(s, 3508, RORV, ext, a0, a1, TCG_REG_TMP0);
         }
         break;
 
@@ -2532,8 +2532,8 @@ static void tcg_out_vec_op(TCGContext *s, TCGOpcode opc,
                             break;
                         }
                     }
-                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP, 0);
-                    a2 = TCG_VEC_TMP;
+                    tcg_out_dupi_vec(s, type, MO_8, TCG_VEC_TMP0, 0);
+                    a2 = TCG_VEC_TMP0;
                 }
                 if (is_scalar) {
                     insn = cmp_scalar_insn[cond];
@@ -2942,9 +2942,9 @@ static void tcg_target_init(TCGContext *s)
     s->reserved_regs = 0;
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_SP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
-    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
-    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
 /* Saving pairs: (X19, X20) .. (X27, X28), (X29(fp), X30(lr)).  */
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 55/57] tcg/aarch64: Support 128-bit load/store
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (53 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 54/57] tcg/aarch64: Rename temporaries Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:41   ` Peter Maydell
  2023-05-03  7:06 ` [PATCH v4 56/57] tcg/ppc: " Richard Henderson
                   ` (2 subsequent siblings)
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Use LDXP+STXP when LSE2 is not present and 16-byte atomicity is required,
and LDP/STP otherwise.  This requires allocating a second general-purpose
temporary, as Rs cannot overlap Rn in STXP.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/aarch64/tcg-target-con-set.h |   2 +
 tcg/aarch64/tcg-target.h         |   2 +-
 tcg/aarch64/tcg-target.c.inc     | 181 ++++++++++++++++++++++++++++++-
 3 files changed, 181 insertions(+), 4 deletions(-)

diff --git a/tcg/aarch64/tcg-target-con-set.h b/tcg/aarch64/tcg-target-con-set.h
index d6c6866878..74065c7098 100644
--- a/tcg/aarch64/tcg-target-con-set.h
+++ b/tcg/aarch64/tcg-target-con-set.h
@@ -14,6 +14,7 @@ C_O0_I2(lZ, l)
 C_O0_I2(r, rA)
 C_O0_I2(rZ, r)
 C_O0_I2(w, r)
+C_O0_I3(lZ, lZ, l)
 C_O1_I1(r, l)
 C_O1_I1(r, r)
 C_O1_I1(w, r)
@@ -33,4 +34,5 @@ C_O1_I2(w, w, wO)
 C_O1_I2(w, w, wZ)
 C_O1_I3(w, w, w, w)
 C_O1_I4(r, r, rA, rZ, rZ)
+C_O2_I1(r, r, l)
 C_O2_I4(r, r, rZ, rZ, rA, rMZ)
diff --git a/tcg/aarch64/tcg-target.h b/tcg/aarch64/tcg-target.h
index 74ee2ed255..fa6af9746f 100644
--- a/tcg/aarch64/tcg-target.h
+++ b/tcg/aarch64/tcg-target.h
@@ -129,7 +129,7 @@ extern bool have_lse2;
 #define TCG_TARGET_HAS_muluh_i64        1
 #define TCG_TARGET_HAS_mulsh_i64        1
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128   1
 
 #define TCG_TARGET_HAS_v64              1
 #define TCG_TARGET_HAS_v128             1
diff --git a/tcg/aarch64/tcg-target.c.inc b/tcg/aarch64/tcg-target.c.inc
index 76a6bfd202..f1627cb96d 100644
--- a/tcg/aarch64/tcg-target.c.inc
+++ b/tcg/aarch64/tcg-target.c.inc
@@ -81,6 +81,7 @@ bool have_lse;
 bool have_lse2;
 
 #define TCG_REG_TMP0 TCG_REG_X30
+#define TCG_REG_TMP1 TCG_REG_X17
 #define TCG_VEC_TMP0 TCG_REG_V31
 
 #ifndef CONFIG_SOFTMMU
@@ -404,6 +405,10 @@ typedef enum {
     I3305_LDR_v64   = 0x5c000000,
     I3305_LDR_v128  = 0x9c000000,
 
+    /* Load/store exclusive. */
+    I3306_LDXP      = 0xc8600000,
+    I3306_STXP      = 0xc8200000,
+
     /* Load/store register.  Described here as 3.3.12, but the helper
        that emits them can transform to 3.3.10 or 3.3.13.  */
     I3312_STRB      = 0x38000000 | LDST_ST << 22 | MO_8 << 30,
@@ -468,6 +473,9 @@ typedef enum {
     I3406_ADR       = 0x10000000,
     I3406_ADRP      = 0x90000000,
 
+    /* Add/subtract extended register instructions. */
+    I3501_ADD       = 0x0b200000,
+
     /* Add/subtract shifted register instructions (without a shift).  */
     I3502_ADD       = 0x0b000000,
     I3502_ADDS      = 0x2b000000,
@@ -638,6 +646,12 @@ static void tcg_out_insn_3305(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (imm19 & 0x7ffff) << 5 | rt);
 }
 
+static void tcg_out_insn_3306(TCGContext *s, AArch64Insn insn, TCGReg rs,
+                              TCGReg rt, TCGReg rt2, TCGReg rn)
+{
+    tcg_out32(s, insn | rs << 16 | rt2 << 10 | rn << 5 | rt);
+}
+
 static void tcg_out_insn_3201(TCGContext *s, AArch64Insn insn, TCGType ext,
                               TCGReg rt, int imm19)
 {
@@ -720,6 +734,14 @@ static void tcg_out_insn_3406(TCGContext *s, AArch64Insn insn,
     tcg_out32(s, insn | (disp & 3) << 29 | (disp & 0x1ffffc) << (5 - 2) | rd);
 }
 
+static inline void tcg_out_insn_3501(TCGContext *s, AArch64Insn insn,
+                                     TCGType sf, TCGReg rd, TCGReg rn,
+                                     TCGReg rm, int opt, int imm3)
+{
+    tcg_out32(s, insn | sf << 31 | rm << 16 | opt << 13 |
+              imm3 << 10 | rn << 5 | rd);
+}
+
 /* This function is for both 3.5.2 (Add/Subtract shifted register), for
    the rare occasion when we actually want to supply a shift amount.  */
 static inline void tcg_out_insn_3502S(TCGContext *s, AArch64Insn insn,
@@ -1648,17 +1670,17 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     TCGType addr_type = TARGET_LONG_BITS == 64 ? TCG_TYPE_I64 : TCG_TYPE_I32;
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    MemOp atom_u;
+    MemOp atom_u, s_bits;
     unsigned a_mask;
 
+    s_bits = opc & MO_SIZE;
     h->align = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
                                       have_lse2 ? MO_ATOM_WITHIN16
                                                 : MO_ATOM_IFALIGN,
-                                      false);
+                                      s_bits == MO_128);
     a_mask = (1 << h->align) - 1;
 
 #ifdef CONFIG_SOFTMMU
-    unsigned s_bits = opc & MO_SIZE;
     unsigned s_mask = (1u << s_bits) - 1;
     unsigned mem_index = get_mmuidx(oi);
     TCGReg x3;
@@ -1839,6 +1861,148 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static TCGLabelQemuLdst *
+prepare_host_addr_base_only(TCGContext *s, HostAddress *h, TCGReg addr_reg,
+                            MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+
+    ldst = prepare_host_addr(s, h, addr_reg, oi, true);
+
+    /* Compose the final address, as LDP/STP have no indexing. */
+    if (h->index != TCG_REG_XZR) {
+        tcg_out_insn(s, 3501, ADD, TCG_TYPE_I64, TCG_REG_TMP0,
+                     h->base, h->index,
+                     h->index_ext == TCG_TYPE_I32 ? MO_32 : MO_64, 0);
+        h->base = TCG_REG_TMP0;
+        h->index = TCG_REG_XZR;
+        h->index_ext = TCG_TYPE_I64;
+    }
+
+    return ldst;
+}
+
+static void tcg_out_qemu_ld128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                               TCGReg addr_reg, MemOpIdx oi)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+
+    ldst = prepare_host_addr_base_only(s, &h, addr_reg, oi, true);
+
+    if (h.atom < MO_128 || have_lse2) {
+        tcg_out_insn(s, 3314, LDP, datalo, datahi, h.base, 0, 0, 0);
+    } else {
+        TCGLabel *l0, *l1 = NULL;
+
+        /*
+         * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+         * 1: ldxp lo,hi,[addr]
+         *    stxp tmp1,lo,hi,[addr]
+         *    cbnz tmp1, 1b
+         *
+         * If we have already checked for 16-byte alignment, that's all
+         * we need. Otherwise we have determined that misaligned atomicity
+         * may be handled with two 8-byte loads.
+         */
+        if (h.align < MO_128) {
+            /*
+             * TODO: align should be MO_64, so we only need test bit 3,
+             * which means we could use TBNZ instead of AND+CBNE.
+             */
+            l1 = gen_new_label();
+            tcg_out_logicali(s, I3404_ANDI, 0, TCG_REG_TMP1, addr_reg, 15);
+            tcg_out_brcond(s, TCG_TYPE_I32, TCG_COND_NE,
+                           TCG_REG_TMP1, 0, 1, l1);
+        }
+
+        l0 = gen_new_label();
+        tcg_out_label(s, l0);
+
+        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR, datalo, datahi, h.base);
+        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP1, datalo, datahi, h.base);
+        tcg_out_brcond(s, TCG_TYPE_I32, TCG_COND_NE, TCG_REG_TMP1, 0, 1, l0);
+
+        if (l1) {
+            TCGLabel *l2 = gen_new_label();
+            tcg_out_goto_label(s, l2);
+
+            tcg_out_label(s, l1);
+            tcg_out_insn(s, 3314, LDP, datalo, datahi, h.base, 0, 0, 0);
+
+            tcg_out_label(s, l2);
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
+static void tcg_out_qemu_st128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                               TCGReg addr_reg, MemOpIdx oi)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+
+    ldst = prepare_host_addr_base_only(s, &h, addr_reg, oi, false);
+
+    if (h.atom < MO_128 || have_lse2) {
+        tcg_out_insn(s, 3314, STP, datalo, datahi, h.base, 0, 0, 0);
+    } else {
+        TCGLabel *l0, *l1 = NULL;
+
+        /*
+         * 16-byte atomicity without LSE2 requires LDXP+STXP loop:
+         * 1: ldxp xzr,tmp1,[addr]
+         *    stxp tmp1,lo,hi,[addr]
+         *    cbnz tmp1, 1b
+         *
+         * If we have already checked for 16-byte alignment, that's all
+         * we need. Otherwise we have determined that misaligned atomicity
+         * may be handled with two 8-byte stores.
+         */
+        if (h.align < MO_128) {
+            /*
+             * TODO: align should be MO_64, so we only need test bit 3,
+             * which means we could use TBNZ instead of AND+CBNE.
+             */
+            l1 = gen_new_label();
+            tcg_out_logicali(s, I3404_ANDI, 0, TCG_REG_TMP1, addr_reg, 15);
+            tcg_out_brcond(s, TCG_TYPE_I32, TCG_COND_NE,
+                           TCG_REG_TMP1, 0, 1, l1);
+        }
+
+        l0 = gen_new_label();
+        tcg_out_label(s, l0);
+
+        tcg_out_insn(s, 3306, LDXP, TCG_REG_XZR,
+                     TCG_REG_XZR, TCG_REG_TMP1, h.base);
+        tcg_out_insn(s, 3306, STXP, TCG_REG_TMP1, datalo, datahi, h.base);
+        tcg_out_brcond(s, TCG_TYPE_I32, TCG_COND_NE, TCG_REG_TMP1, 0, 1, l0);
+
+        if (l1) {
+            TCGLabel *l2 = gen_new_label();
+            tcg_out_goto_label(s, l2);
+
+            tcg_out_label(s, l1);
+            tcg_out_insn(s, 3314, STP, datalo, datahi, h.base, 0, 0, 0);
+
+            tcg_out_label(s, l2);
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static const tcg_insn_unit *tb_ret_addr;
 
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
@@ -2176,6 +2340,12 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_i64:
         tcg_out_qemu_st(s, REG0(0), a1, a2, ext);
         break;
+    case INDEX_op_qemu_ld_i128:
+        tcg_out_qemu_ld128(s, a0, a1, a2, args[3]);
+        break;
+    case INDEX_op_qemu_st_i128:
+        tcg_out_qemu_st128(s, REG0(0), REG0(1), a2, args[3]);
+        break;
 
     case INDEX_op_bswap64_i64:
         tcg_out_rev(s, TCG_TYPE_I64, MO_64, a0, a1);
@@ -2813,9 +2983,13 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_ld_i32:
     case INDEX_op_qemu_ld_i64:
         return C_O1_I1(r, l);
+    case INDEX_op_qemu_ld_i128:
+        return C_O2_I1(r, r, l);
     case INDEX_op_qemu_st_i32:
     case INDEX_op_qemu_st_i64:
         return C_O0_I2(lZ, l);
+    case INDEX_op_qemu_st_i128:
+        return C_O0_I3(lZ, lZ, l);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
@@ -2944,6 +3118,7 @@ static void tcg_target_init(TCGContext *s)
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_FP);
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_X18); /* platform register */
     tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP0);
+    tcg_regset_set_reg(s->reserved_regs, TCG_REG_TMP1);
     tcg_regset_set_reg(s->reserved_regs, TCG_VEC_TMP0);
 }
 
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 56/57] tcg/ppc: Support 128-bit load/store
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (54 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 55/57] tcg/aarch64: Support 128-bit load/store Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-08 12:16   ` Daniel Henrique Barboza
  2023-05-03  7:06 ` [PATCH v4 57/57] tcg/s390x: " Richard Henderson
  2023-05-05 13:43 ` [PATCH v4 00/57] tcg: Improve atomicity support Peter Maydell
  57 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
Note that these instructions do not require 16-byte alignment.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/ppc/tcg-target-con-set.h |   2 +
 tcg/ppc/tcg-target-con-str.h |   1 +
 tcg/ppc/tcg-target.h         |   3 +-
 tcg/ppc/tcg-target.c.inc     | 173 +++++++++++++++++++++++++++++++----
 4 files changed, 158 insertions(+), 21 deletions(-)

diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
index f206b29205..bbd7b21247 100644
--- a/tcg/ppc/tcg-target-con-set.h
+++ b/tcg/ppc/tcg-target-con-set.h
@@ -14,6 +14,7 @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(v, r)
 C_O0_I3(r, r, r)
+C_O0_I3(o, m, r)
 C_O0_I4(r, r, ri, ri)
 C_O0_I4(r, r, r, r)
 C_O1_I1(r, r)
@@ -34,6 +35,7 @@ C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rZ, rZ)
 C_O1_I4(r, r, r, ri, ri)
 C_O2_I1(r, r, r)
+C_O2_I1(o, m, r)
 C_O2_I2(r, r, r, r)
 C_O2_I4(r, r, rI, rZM, r, r)
 C_O2_I4(r, r, r, r, rI, rZM)
diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
index 094613cbcb..20846901de 100644
--- a/tcg/ppc/tcg-target-con-str.h
+++ b/tcg/ppc/tcg-target-con-str.h
@@ -9,6 +9,7 @@
  * REGS(letter, register_mask)
  */
 REGS('r', ALL_GENERAL_REGS)
+REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
 REGS('v', ALL_VECTOR_REGS)
 
 /*
diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
index 0914380bd7..204b70f86a 100644
--- a/tcg/ppc/tcg-target.h
+++ b/tcg/ppc/tcg-target.h
@@ -149,7 +149,8 @@ extern bool have_vsx;
 #define TCG_TARGET_HAS_mulsh_i64        1
 #endif
 
-#define TCG_TARGET_HAS_qemu_ldst_i128   0
+#define TCG_TARGET_HAS_qemu_ldst_i128   \
+    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
 
 /*
  * While technically Altivec could support V64, it has no 64-bit store
diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
index 60375804cd..682743a466 100644
--- a/tcg/ppc/tcg-target.c.inc
+++ b/tcg/ppc/tcg-target.c.inc
@@ -295,25 +295,27 @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
 
 #define B      OPCD( 18)
 #define BC     OPCD( 16)
+
 #define LBZ    OPCD( 34)
 #define LHZ    OPCD( 40)
 #define LHA    OPCD( 42)
 #define LWZ    OPCD( 32)
 #define LWZUX  XO31( 55)
-#define STB    OPCD( 38)
-#define STH    OPCD( 44)
-#define STW    OPCD( 36)
-
-#define STD    XO62(  0)
-#define STDU   XO62(  1)
-#define STDX   XO31(149)
-
 #define LD     XO58(  0)
 #define LDX    XO31( 21)
 #define LDU    XO58(  1)
 #define LDUX   XO31( 53)
 #define LWA    XO58(  2)
 #define LWAX   XO31(341)
+#define LQ     OPCD( 56)
+
+#define STB    OPCD( 38)
+#define STH    OPCD( 44)
+#define STW    OPCD( 36)
+#define STD    XO62(  0)
+#define STDU   XO62(  1)
+#define STDX   XO31(149)
+#define STQ    XO62(  2)
 
 #define ADDIC  OPCD( 12)
 #define ADDI   OPCD( 14)
@@ -2015,11 +2017,25 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
 typedef struct {
     TCGReg base;
     TCGReg index;
+    MemOp align;
+    MemOp atom;
 } HostAddress;
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    MemOp atom_a, atom_u;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    (void)atom_and_align_for_opc(tcg_ctx, &atom_a, &atom_u, memop,
+                                 MO_ATOM_IFALIGN, true);
+    return atom_a <= MO_64;
 }
 
 /*
@@ -2034,7 +2050,7 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 {
     TCGLabelQemuLdst *ldst = NULL;
     MemOp opc = get_memop(oi);
-    MemOp a_bits, atom_a, atom_u;
+    MemOp a_bits, atom_u, s_bits;
 
     /*
      * Book II, Section 1.4, Single-Copy Atomicity, specifies:
@@ -2046,10 +2062,19 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
      * As of 3.0, "the non-atomic access is performed as described in
      * the corresponding list", which matches MO_ATOM_SUBALIGN.
      */
-    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
+    s_bits = opc & MO_SIZE;
+    a_bits = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
                                     have_isa_3_00 ? MO_ATOM_SUBALIGN
                                                   : MO_ATOM_IFALIGN,
-                                    false);
+                                    s_bits == MO_128);
+
+    if (TCG_TARGET_REG_BITS == 32) {
+        /* We don't support unaligned accesses on 32-bits. */
+        if (a_bits < s_bits) {
+            a_bits = s_bits;
+        }
+    }
+    h->align = a_bits;
 
 #ifdef CONFIG_SOFTMMU
     int mem_index = get_mmuidx(oi);
@@ -2058,7 +2083,6 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
     int fast_off = TLB_MASK_TABLE_OFS(mem_index);
     int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
     int table_off = fast_off + offsetof(CPUTLBDescFast, table);
-    unsigned s_bits = opc & MO_SIZE;
 
     ldst = new_ldst_label(s);
     ldst->is_ld = is_ld;
@@ -2108,13 +2132,6 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
 
     /* Clear the non-page, non-alignment bits from the address in R0. */
     if (TCG_TARGET_REG_BITS == 32) {
-        /* We don't support unaligned accesses on 32-bits.
-         * Preserve the bottom bits and thus trigger a comparison
-         * failure on unaligned accesses.
-         */
-        if (a_bits < s_bits) {
-            a_bits = s_bits;
-        }
         tcg_out_rlw(s, RLWINM, TCG_REG_R0, addrlo, 0,
                     (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
     } else {
@@ -2299,6 +2316,108 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
     }
 }
 
+static TCGLabelQemuLdst *
+prepare_host_addr_index_only(TCGContext *s, HostAddress *h, TCGReg addr_reg,
+                             MemOpIdx oi, bool is_ld)
+{
+    TCGLabelQemuLdst *ldst;
+
+    ldst = prepare_host_addr(s, h, addr_reg, -1, oi, true);
+
+    /* Compose the final address, as LQ/STQ have no indexing. */
+    if (h->base != 0) {
+        tcg_out32(s, ADD | TAB(TCG_REG_TMP1, h->base, h->index));
+        h->index = TCG_REG_TMP1;
+        h->base = 0;
+    }
+
+    return ldst;
+}
+
+static void tcg_out_qemu_ld128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                               TCGReg addr_reg, MemOpIdx oi)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+
+    ldst = prepare_host_addr_index_only(s, &h, addr_reg, oi, true);
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (h.atom == MO_128) {
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        tcg_out32(s, LQ | TAI(datahi, h.index, 0));
+    } else {
+        TCGReg d1, d2;
+
+        if (HOST_BIG_ENDIAN ^ need_bswap) {
+            d1 = datahi, d2 = datalo;
+        } else {
+            d1 = datalo, d2 = datahi;
+        }
+
+        if (need_bswap) {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
+            tcg_out32(s, LDBRX | TAB(d1, 0, h.index));
+            tcg_out32(s, LDBRX | TAB(d2, h.index, TCG_REG_R0));
+        } else {
+            tcg_out32(s, LD | TAI(d1, h.index, 0));
+            tcg_out32(s, LD | TAI(d2, h.index, 8));
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
+static void tcg_out_qemu_st128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                               TCGReg addr_reg, MemOpIdx oi)
+{
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+
+    ldst = prepare_host_addr_index_only(s, &h, addr_reg, oi, false);
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (h.atom == MO_128) {
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        tcg_out32(s, STQ | TAI(datahi, h.index, 0));
+    } else {
+        TCGReg d1, d2;
+
+        if (HOST_BIG_ENDIAN ^ need_bswap) {
+            d1 = datahi, d2 = datalo;
+        } else {
+            d1 = datalo, d2 = datahi;
+        }
+
+        if (need_bswap) {
+            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
+            tcg_out32(s, STDBRX | TAB(d1, 0, h.index));
+            tcg_out32(s, STDBRX | TAB(d2, h.index, TCG_REG_R0));
+        } else {
+            tcg_out32(s, STD | TAI(d1, h.index, 0));
+            tcg_out32(s, STD | TAI(d2, h.index, 8));
+        }
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
 {
     int i;
@@ -2849,6 +2968,11 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_ld_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_ld128(s, args[0], args[1], args[2], args[3]);
+        break;
+
     case INDEX_op_qemu_st_i32:
         if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
             tcg_out_qemu_st(s, args[0], -1, args[1], -1,
@@ -2870,6 +2994,10 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
                             args[4], TCG_TYPE_I64);
         }
         break;
+    case INDEX_op_qemu_st_i128:
+        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
+        tcg_out_qemu_st128(s, args[0], args[1], args[2], args[3]);
+        break;
 
     case INDEX_op_setcond_i32:
         tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
@@ -3705,6 +3833,11 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
                 : TARGET_LONG_BITS == 32 ? C_O0_I3(r, r, r)
                 : C_O0_I4(r, r, r, r));
 
+    case INDEX_op_qemu_ld_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_i128:
+        return C_O0_I3(o, m, r);
+
     case INDEX_op_add_vec:
     case INDEX_op_sub_vec:
     case INDEX_op_mul_vec:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* [PATCH v4 57/57] tcg/s390x: Support 128-bit load/store
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (55 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 56/57] tcg/ppc: " Richard Henderson
@ 2023-05-03  7:06 ` Richard Henderson
  2023-05-05 13:43 ` [PATCH v4 00/57] tcg: Improve atomicity support Peter Maydell
  57 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-03  7:06 UTC (permalink / raw)
  To: qemu-devel; +Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Use LPQ/STPQ when 16-byte atomicity is required.
Note that these instructions require 16-byte alignment.

Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
 tcg/s390x/tcg-target-con-set.h |   2 +
 tcg/s390x/tcg-target.h         |   2 +-
 tcg/s390x/tcg-target.c.inc     | 100 ++++++++++++++++++++++++++++++++-
 3 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/tcg/s390x/tcg-target-con-set.h b/tcg/s390x/tcg-target-con-set.h
index ecc079bb6d..cbad91b2b5 100644
--- a/tcg/s390x/tcg-target-con-set.h
+++ b/tcg/s390x/tcg-target-con-set.h
@@ -14,6 +14,7 @@ C_O0_I2(r, r)
 C_O0_I2(r, ri)
 C_O0_I2(r, rA)
 C_O0_I2(v, r)
+C_O0_I3(o, m, r)
 C_O1_I1(r, r)
 C_O1_I1(v, r)
 C_O1_I1(v, v)
@@ -36,6 +37,7 @@ C_O1_I2(v, v, v)
 C_O1_I3(v, v, v, v)
 C_O1_I4(r, r, ri, rI, r)
 C_O1_I4(r, r, rA, rI, r)
+C_O2_I1(o, m, r)
 C_O2_I2(o, m, 0, r)
 C_O2_I2(o, m, r, r)
 C_O2_I3(o, m, 0, 1, r)
diff --git a/tcg/s390x/tcg-target.h b/tcg/s390x/tcg-target.h
index 170007bea5..ec96952172 100644
--- a/tcg/s390x/tcg-target.h
+++ b/tcg/s390x/tcg-target.h
@@ -140,7 +140,7 @@ extern uint64_t s390_facilities[3];
 #define TCG_TARGET_HAS_muluh_i64      0
 #define TCG_TARGET_HAS_mulsh_i64      0
 
-#define TCG_TARGET_HAS_qemu_ldst_i128 0
+#define TCG_TARGET_HAS_qemu_ldst_i128 1
 
 #define TCG_TARGET_HAS_v64            HAVE_FACILITY(VECTOR)
 #define TCG_TARGET_HAS_v128           HAVE_FACILITY(VECTOR)
diff --git a/tcg/s390x/tcg-target.c.inc b/tcg/s390x/tcg-target.c.inc
index ddd9860a6a..91fecfc51b 100644
--- a/tcg/s390x/tcg-target.c.inc
+++ b/tcg/s390x/tcg-target.c.inc
@@ -243,6 +243,7 @@ typedef enum S390Opcode {
     RXY_LLGF    = 0xe316,
     RXY_LLGH    = 0xe391,
     RXY_LMG     = 0xeb04,
+    RXY_LPQ     = 0xe38f,
     RXY_LRV     = 0xe31e,
     RXY_LRVG    = 0xe30f,
     RXY_LRVH    = 0xe31f,
@@ -253,6 +254,7 @@ typedef enum S390Opcode {
     RXY_STG     = 0xe324,
     RXY_STHY    = 0xe370,
     RXY_STMG    = 0xeb24,
+    RXY_STPQ    = 0xe38e,
     RXY_STRV    = 0xe33e,
     RXY_STRVG   = 0xe32f,
     RXY_STRVH   = 0xe33f,
@@ -1578,7 +1580,19 @@ typedef struct {
 
 bool tcg_target_has_memory_bswap(MemOp memop)
 {
-    return true;
+    MemOp atom_a, atom_u;
+
+    if ((memop & MO_SIZE) <= MO_64) {
+        return true;
+    }
+
+    /*
+     * Reject 16-byte memop with 16-byte atomicity,
+     * but do allow a pair of 64-bit operations.
+     */
+    (void)atom_and_align_for_opc(tcg_ctx, &atom_a, &atom_u, memop,
+                                 MO_ATOM_IFALIGN, true);
+    return atom_a <= MO_64;
 }
 
 static void tcg_out_qemu_ld_direct(TCGContext *s, MemOp opc, TCGReg data,
@@ -1868,6 +1882,80 @@ static void tcg_out_qemu_st(TCGContext* s, TCGReg data_reg, TCGReg addr_reg,
     }
 }
 
+static void tcg_out_qemu_ldst_i128(TCGContext *s, TCGReg datalo, TCGReg datahi,
+                                   TCGReg addr_reg, MemOpIdx oi, bool is_ld)
+{
+    TCGLabel *l1 = NULL, *l2 = NULL;
+    TCGLabelQemuLdst *ldst;
+    HostAddress h;
+    bool need_bswap;
+    bool use_pair;
+    S390Opcode insn;
+
+    ldst = prepare_host_addr(s, &h, addr_reg, oi, is_ld);
+
+    use_pair = h.atom < MO_128;
+    need_bswap = get_memop(oi) & MO_BSWAP;
+
+    if (!use_pair) {
+        /*
+         * Atomicity requires we use LPQ.  If we've already checked for
+         * 16-byte alignment, that's all we need.  If we arrive with
+         * lesser alignment, we have determined that less than 16-byte
+         * alignment can be satisfied with two 8-byte loads.
+         */
+        if (h.align < MO_128) {
+            use_pair = true;
+            l1 = gen_new_label();
+            l2 = gen_new_label();
+
+            tcg_out_insn(s, RI, TMLL, addr_reg, 15);
+            tgen_branch(s, 7, l1); /* CC in {1,2,3} */
+        }
+
+        tcg_debug_assert(!need_bswap);
+        tcg_debug_assert(datalo & 1);
+        tcg_debug_assert(datahi == datalo - 1);
+        insn = is_ld ? RXY_LPQ : RXY_STPQ;
+        tcg_out_insn_RXY(s, insn, datahi, h.base, h.index, h.disp);
+
+        if (use_pair) {
+            tgen_branch(s, S390_CC_ALWAYS, l2);
+            tcg_out_label(s, l1);
+        }
+    }
+    if (use_pair) {
+        TCGReg d1, d2;
+
+        if (need_bswap) {
+            d1 = datalo, d2 = datahi;
+            insn = is_ld ? RXY_LRVG : RXY_STRVG;
+        } else {
+            d1 = datahi, d2 = datalo;
+            insn = is_ld ? RXY_LG : RXY_STG;
+        }
+
+        if (h.base == d1 || h.index == d1) {
+            tcg_out_insn(s, RXY, LAY, TCG_TMP0, h.base, h.index, h.disp);
+            h.base = TCG_TMP0;
+            h.index = TCG_REG_NONE;
+            h.disp = 0;
+        }
+        tcg_out_insn_RXY(s, insn, d1, h.base, h.index, h.disp);
+        tcg_out_insn_RXY(s, insn, d2, h.base, h.index, h.disp + 8);
+    }
+    if (l2) {
+        tcg_out_label(s, l2);
+    }
+
+    if (ldst) {
+        ldst->type = TCG_TYPE_I128;
+        ldst->datalo_reg = datalo;
+        ldst->datahi_reg = datahi;
+        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
+    }
+}
+
 static void tcg_out_exit_tb(TCGContext *s, uintptr_t a0)
 {
     /* Reuse the zeroing that exists for goto_ptr.  */
@@ -2225,6 +2313,12 @@ static inline void tcg_out_op(TCGContext *s, TCGOpcode opc,
     case INDEX_op_qemu_st_i64:
         tcg_out_qemu_st(s, args[0], args[1], args[2], TCG_TYPE_I64);
         break;
+    case INDEX_op_qemu_ld_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], true);
+        break;
+    case INDEX_op_qemu_st_i128:
+        tcg_out_qemu_ldst_i128(s, args[0], args[1], args[2], args[3], false);
+        break;
 
     case INDEX_op_ld16s_i64:
         tcg_out_mem(s, 0, RXY_LGH, args[0], args[1], TCG_REG_NONE, args[2]);
@@ -3102,6 +3196,10 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
     case INDEX_op_qemu_st_i64:
     case INDEX_op_qemu_st_i32:
         return C_O0_I2(r, r);
+    case INDEX_op_qemu_ld_i128:
+        return C_O2_I1(o, m, r);
+    case INDEX_op_qemu_st_i128:
+        return C_O0_I3(o, m, r);
 
     case INDEX_op_deposit_i32:
     case INDEX_op_deposit_i64:
-- 
2.34.1



^ permalink raw reply related	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 01/57] include/exec/memop: Add bits describing atomicity
  2023-05-03  7:06 ` [PATCH v4 01/57] include/exec/memop: Add bits describing atomicity Richard Henderson
@ 2023-05-04 14:49   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-04 14:49 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On Wed, 3 May 2023 at 08:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> These bits may be used to describe the precise atomicity
> requirements of the guest, which may then be used to
> constrain the methods by which it may be emulated by the host.
>
> For instance, the AArch64 LDP (32-bit) instruction changes
> semantics with ARMv8.4 LSE2, from
>
>   MO_64 | MO_ATMAX_4 | MO_ATOM_IFALIGN
>   (64-bits, single-copy atomic only on 4 byte units,
>    nonatomic if not aligned by 4),
>
> to
>
>   MO_64 | MO_ATMAX_SIZE | MO_ATOM_WITHIN16
>   (64-bits, single-copy atomic within a 16 byte block)
>
> The former may be implemented with two 4 byte loads, or
> a single 8 byte load if that happens to be efficient on
> the host.  The latter may not, and may also require a
> helper when misaligned.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  include/exec/memop.h | 36 ++++++++++++++++++++++++++++++++++++
>  1 file changed, 36 insertions(+)
>
> diff --git a/include/exec/memop.h b/include/exec/memop.h
> index 25d027434a..04e4048f0b 100644
> --- a/include/exec/memop.h
> +++ b/include/exec/memop.h
> @@ -81,6 +81,42 @@ typedef enum MemOp {
>      MO_ALIGN_32 = 5 << MO_ASHIFT,
>      MO_ALIGN_64 = 6 << MO_ASHIFT,
>
> +    /*
> +     * MO_ATOM_* describes that atomicity requirements of the operation:

"the atomicity requirements"

> +     * MO_ATOM_IFALIGN: the operation must be single-copy atomic if and
> +     *    only if it is aligned; if unaligned there is no atomicity.

Is this really "and only if", ie "must *not* be single-copy-atomic if
non aligned"? Plain old "if" seems more likely...

> +     * MO_ATOM_NONE: the operation has no atomicity requirements.
> +     * MO_ATOM_SUBALIGN: the operation is single-copy atomic by parts
> +     *    by the alignment.  E.g. if the address is 0 mod 4, then each
> +     *    4-byte subobject is single-copy atomic.
> +     *    This is the atomicity of IBM Power and S390X processors.
> +     * MO_ATOM_WITHIN16: the operation is single-copy atomic, even if it
> +     *    is unaligned, so long as it does not cross a 16-byte boundary;
> +     *    if it crosses a 16-byte boundary there is no atomicity.
> +     *    This is the atomicity of Arm FEAT_LSE2.
> +     *
> +     * MO_ATMAX_* describes the maximum atomicity unit required:
> +     * MO_ATMAX_SIZE: the entire operation, i.e. MO_SIZE.
> +     * MO_ATMAX_[248]: units of N bytes.
> +     *
> +     * Note the default (i.e. 0) values are single-copy atomic to the
> +     * size of the operation, if aligned.  This retains the behaviour
> +     * from before these were introduced.
> +     */
> +    MO_ATOM_SHIFT    = 8,
> +    MO_ATOM_MASK     = 0x3 << MO_ATOM_SHIFT,
> +    MO_ATOM_IFALIGN  = 0 << MO_ATOM_SHIFT,
> +    MO_ATOM_NONE     = 1 << MO_ATOM_SHIFT,
> +    MO_ATOM_SUBALIGN = 2 << MO_ATOM_SHIFT,
> +    MO_ATOM_WITHIN16 = 3 << MO_ATOM_SHIFT,
> +
> +    MO_ATMAX_SHIFT = 10,
> +    MO_ATMAX_MASK  = 0x3 << MO_ATMAX_SHIFT,
> +    MO_ATMAX_SIZE  = 0 << MO_ATMAX_SHIFT,
> +    MO_ATMAX_2     = 1 << MO_ATMAX_SHIFT,
> +    MO_ATMAX_4     = 2 << MO_ATMAX_SHIFT,
> +    MO_ATMAX_8     = 3 << MO_ATMAX_SHIFT,
> +

Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx
  2023-05-03  7:06 ` [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx Richard Henderson
@ 2023-05-04 15:02   ` Peter Maydell
  2023-05-05 18:57     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-04 15:02 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On Wed, 3 May 2023 at 08:15, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of playing with offsetof in various places, use
> MMUAccessType to index an array.  This is easily defined
> instead of the previous dummy padding array in the union.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

> @@ -1802,7 +1763,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
>      if (prot & PAGE_WRITE) {
>          tlb_addr = tlb_addr_write(tlbe);
>          if (!tlb_hit(tlb_addr, addr)) {
> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
> +                                addr & TARGET_PAGE_MASK)) {
>                  tlb_fill(env_cpu(env), addr, size,
>                           MMU_DATA_STORE, mmu_idx, retaddr);
>                  index = tlb_index(env, mmu_idx, addr);
> @@ -1835,7 +1797,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
>      } else /* if (prot & PAGE_READ) */ {
>          tlb_addr = tlbe->addr_read;
>          if (!tlb_hit(tlb_addr, addr)) {
> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_LOAD,
> +                                addr & TARGET_PAGE_MASK)) {

This was previously looking at addr_write, but now we pass
MMU_DATA_LOAD ?

>                  tlb_fill(env_cpu(env), addr, size,
>                           MMU_DATA_LOAD, mmu_idx, retaddr);
>                  index = tlb_index(env, mmu_idx, addr);

Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 04/57] accel/tcg: Reorg system mode load helpers
  2023-05-03  7:06 ` [PATCH v4 04/57] accel/tcg: Reorg system mode load helpers Richard Henderson
@ 2023-05-04 15:39   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-04 15:39 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On Wed, 3 May 2023 at 08:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of trying to unify all operations on uint64_t, pull out
> mmu_lookup() to perform the basic tlb hit and resolution.
> Create individual functions to handle access by size.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

> +/**
> + * mmu_watch_or_dirty
> + * @env: cpu context
> + * @data: lookup parameters
> + * @access_type: load/store/code
> + * @ra: return address into tcg generated code, or 0
> + *
> + * Trigger watchpoints for @data.addr:@data.size;
> + * record writes to protected clean pages.
> + */
> +static void mmu_watch_or_dirty(CPUArchState *env, MMULookupPageData *data,
> +                               MMUAccessType access_type, uintptr_t ra)
> +{
> +    CPUTLBEntryFull *full = data->full;
> +    target_ulong addr = data->addr;
> +    int flags = data->flags;
> +    int size = data->size;
> +
> +    /* On watchpoint hit, this will longjmp out.  */
> +    if (flags & TLB_WATCHPOINT) {
> +        int wp = access_type == MMU_DATA_STORE ? BP_MEM_WRITE : BP_MEM_READ;
> +        cpu_check_watchpoint(env_cpu(env), addr, size, full->attrs, wp, ra);
> +        flags &= ~TLB_WATCHPOINT;
> +    }
> +
> +    if (flags & TLB_NOTDIRTY) {
> +        notdirty_write(env_cpu(env), addr, size, full, ra);
> +        flags &= ~TLB_NOTDIRTY;
> +    }

Maybe worth a comment that the NOTDIRTY flag is only ever
set for the addr_write TLB entry? This confused me for a bit...

> +    data->flags = flags;
> +}
> +
> +/**
> + * mmu_lookup: translate page(s)
> + * @env: cpu context
> + * @addr: virtual address
> + * @oi: combined mmu_idx and MemOp
> + * @ra: return address into tcg generated code, or 0
> + * @access_type: load/store/code
> + * @l: output result
> + *
> + * Resolve the translation for the page(s) beginning at @addr, for MemOp.size
> + * bytes.  Return true if the lookup crosses a page boundary.
> + */
> +static bool mmu_lookup(CPUArchState *env, target_ulong addr, MemOpIdx oi,
> +                       uintptr_t ra, MMUAccessType type, MMULookupLocals *l)
> +{
> +    unsigned a_bits;
> +    bool crosspage;
> +    int flags;
> +
> +    l->memop = get_memop(oi);
> +    l->mmu_idx = get_mmuidx(oi);
> +
> +    tcg_debug_assert(l->mmu_idx < NB_MMU_MODES);
> +
> +    /* Handle CPU specific unaligned behaviour */
> +    a_bits = get_alignment_bits(l->memop);
> +    if (addr & ((1 << a_bits) - 1)) {
> +        cpu_unaligned_access(env_cpu(env), addr, type, l->mmu_idx, ra);
> +    }
> +
> +    l->page[0].addr = addr;
> +    l->page[0].size = memop_size(l->memop);
> +    l->page[1].addr = (addr + l->page[0].size - 1) & TARGET_PAGE_MASK;
> +    l->page[1].size = 0;
> +    crosspage = (addr ^ l->page[1].addr) & TARGET_PAGE_MASK;
> +
> +    if (likely(!crosspage)) {
> +        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
> +
> +        flags = l->page[0].flags;
> +        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
> +            mmu_watch_or_dirty(env, &l->page[0], type, ra);
> +        }
> +        if (unlikely(flags & TLB_BSWAP)) {
> +            l->memop ^= MO_BSWAP;
> +        }
> +    } else {
> +        /* Finish compute of page crossing. */
> +        int size1 = l->page[1].addr - addr;
> +        l->page[1].size = l->page[0].size - size1;
> +        l->page[0].size = size1;

Should this local variable be named "size0" ? It seems to be
the size of the access done to page[0], not the size of the
access to page[1].

> +
> +        /*
> +         * Lookup both pages, recognizing exceptions from either.  If the
> +         * second lookup potentially resized, refresh first CPUTLBEntryFull.
> +         */
> +        mmu_lookup1(env, &l->page[0], l->mmu_idx, type, ra);
> +        if (mmu_lookup1(env, &l->page[1], l->mmu_idx, type, ra)) {
> +            uintptr_t index = tlb_index(env, l->mmu_idx, addr);
> +            l->page[0].full = &env_tlb(env)->d[l->mmu_idx].fulltlb[index];
> +        }
> +
> +        flags = l->page[0].flags | l->page[1].flags;
> +        if (unlikely(flags & (TLB_WATCHPOINT | TLB_NOTDIRTY))) {
> +            mmu_watch_or_dirty(env, &l->page[0], type, ra);
> +            mmu_watch_or_dirty(env, &l->page[1], type, ra);
> +        }
> +
> +        /*
> +         * Since target/sparc is the only user of TLB_BSWAP, and all
> +         * Sparc accesses are aligned, any treatment across two pages
> +         * would be arbitrary.  Refuse it until there's a use.
> +         */
> +        tcg_debug_assert((flags & TLB_BSWAP) == 0);
> +    }
> +
> +    return crosspage;
> +}


> +/**
> + * do_ld_mmio_beN:
> + * @env: cpu context
> + * @p: translation parameters
> + * @ret_be: accumulated data
> + * @mmu_idx: virtual address context
> + * @ra: return address into tcg generated code, or 0
> + *
> + * Load @p->size bytes from @p->addr, which is memory-mapped i/o.
> + * The bytes are concatenated with in big-endian order with @ret_be.

"in big-endian order with"

> + */
> +static uint64_t do_ld_mmio_beN(CPUArchState *env, MMULookupPageData *p,
> +                               uint64_t ret_be, int mmu_idx,
> +                               MMUAccessType type, uintptr_t ra)
>  {
> -    validate_memop(oi, MO_UB);
> -    return load_helper(env, addr, oi, retaddr, MO_UB, MMU_DATA_LOAD,
> -                       full_ldub_mmu);
> +    CPUTLBEntryFull *full = p->full;
> +    target_ulong addr = p->addr;
> +    int i, size = p->size;
> +
> +    QEMU_IOTHREAD_LOCK_GUARD();
> +    for (i = 0; i < size; i++) {
> +        uint8_t x = io_readx(env, full, mmu_idx, addr + i, ra, type, MO_UB);
> +        ret_be = (ret_be << 8) | x;
> +    }
> +    return ret_be;
> +}

Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 05/57] accel/tcg: Reorg system mode store helpers
  2023-05-03  7:06 ` [PATCH v4 05/57] accel/tcg: Reorg system mode store helpers Richard Henderson
@ 2023-05-04 15:44   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-04 15:44 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:23, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of trying to unify all operations on uint64_t, use
> mmu_lookup() to perform the basic tlb hit and resolution.
> Create individual functions to handle access by size.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 06/57] accel/tcg: Honor atomicity of loads
  2023-05-03  7:06 ` [PATCH v4 06/57] accel/tcg: Honor atomicity of loads Richard Henderson
@ 2023-05-04 17:17   ` Peter Maydell
  2023-05-05 20:19     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-04 17:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On Wed, 3 May 2023 at 08:08, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Create ldst_atomicity.c.inc.
>
> Not required for user-only code loads, because we've ensured that
> the page is read-only before beginning to translate code.
>
> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---



> +/**
> + * required_atomicity:
> + *
> + * Return the lg2 bytes of atomicity required by @memop for @p.
> + * If the operation must be split into two operations to be
> + * examined separately for atomicity, return -lg2.
> + */
> +static int required_atomicity(CPUArchState *env, uintptr_t p, MemOp memop)
> +{
> +    int atmax = memop & MO_ATMAX_MASK;
> +    int size = memop & MO_SIZE;
> +    unsigned tmp;
> +
> +    if (atmax == MO_ATMAX_SIZE) {
> +        atmax = size;
> +    } else {
> +        atmax >>= MO_ATMAX_SHIFT;
> +    }
> +
> +    switch (memop & MO_ATOM_MASK) {
> +    case MO_ATOM_IFALIGN:
> +        tmp = (1 << atmax) - 1;
> +        if (p & tmp) {
> +            return MO_8;
> +        }
> +        break;
> +    case MO_ATOM_NONE:
> +        return MO_8;
> +    case MO_ATOM_SUBALIGN:
> +        tmp = p & -p;
> +        if (tmp != 0 && tmp < atmax) {
> +            atmax = tmp;
> +        }
> +        break;

I don't understand the bit manipulation going on here.
AIUI what we're trying to do is say "if e.g. p is only
2-aligned then we only get 2-alignment". But, suppose
p == 0x1002. Then (p & -p) is 0x2. But that's MO_32,
not MO_16. Am I missing something ?

(Also, it would be nice to have a comment mentioning
what (p & -p) does, so readers don't have to try to
search for a not very-searchable expression to find out.)

> +    case MO_ATOM_WITHIN16:
> +        tmp = p & 15;
> +        if (tmp + (1 << size) <= 16) {
> +            atmax = size;

OK, so this is "whole operation is within 16 bytes,
whole operation must be atomic"...

> +        } else if (atmax == size) {
> +            return MO_8;

...but I don't understand the interaction of WITHIN16
and also specifying an ATMAX value that's not ATMAX_SIZE.

> +        } else if (tmp + (1 << atmax) != 16) {

Why is this doing an exact inequality check?
What if you're asking for a load of 8 bytes at
MO_ATMAX_2 from a pointer that's at an offset of
10 bytes from a 16-byte boundary? Then tmp is 10,
tmp + (1 << atmax) is 12, but we could still do the
loads at atomicity 2. This doesn't seem to me to be
any different from the case it does catch where
the first ATMAX_2-sized unit happens to be the only
thing in this 16-byte block.

The doc comment in patch 1 could probably be beefed
up to better explain the interaction of WITHIN16
and ATMAX_*.

> +            /*
> +             * Paired load/store, where the pairs aren't aligned.
> +             * One of the two must still be handled atomically.
> +             */
> +            atmax = -atmax;
> +        }
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    /*
> +     * Here we have the architectural atomicity of the operation.
> +     * However, when executing in a serial context, we need no extra
> +     * host atomicity in order to avoid racing.  This reduction
> +     * avoids looping with cpu_loop_exit_atomic.
> +     */
> +    if (cpu_in_serial_context(env_cpu(env))) {
> +        return MO_8;
> +    }
> +    return atmax;
> +}
> +

> +/**
> + * load_atomic8_or_exit:
> + * @env: cpu context
> + * @ra: host unwind address
> + * @pv: host address
> + *
> + * Atomically load 8 aligned bytes from @pv.
> + * If this is not possible, longjmp out to restart serially.
> + */
> +static uint64_t load_atomic8_or_exit(CPUArchState *env, uintptr_t ra, void *pv)
> +{
> +    if (HAVE_al8) {
> +        return load_atomic8(pv);
> +    }
> +
> +#ifdef CONFIG_USER_ONLY
> +    /*
> +     * If the page is not writable, then assume the value is immutable
> +     * and requires no locking.  This ignores the case of MAP_SHARED with
> +     * another process, because the fallback start_exclusive solution
> +     * provides no protection across processes.
> +     */
> +    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
> +        uint64_t *p = __builtin_assume_aligned(pv, 8);
> +        return *p;
> +    }

This will also do a non-atomic read for the case where
the guest has mapped the same memory twice at different
addresses, once read-only and once writeable, I think.
In theory in that situation we could use start_exclusive.
But maybe that's a weird corner case we can ignore?

> +#endif
> +
> +    /* Ultimate fallback: re-execute in serial context. */
> +    cpu_loop_exit_atomic(env_cpu(env), ra);
> +}
> +

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 07/57] accel/tcg: Honor atomicity of stores
  2023-05-03  7:06 ` [PATCH v4 07/57] accel/tcg: Honor atomicity of stores Richard Henderson
@ 2023-05-05  9:28   ` Peter Maydell
  2023-05-08 10:11     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05  9:28 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  accel/tcg/cputlb.c             | 103 +++----
>  accel/tcg/user-exec.c          |  12 +-
>  accel/tcg/ldst_atomicity.c.inc | 491 +++++++++++++++++++++++++++++++++
>  3 files changed, 540 insertions(+), 66 deletions(-)


> +/**
> + * store_atom_insert_al16:
> + * @p: host address
> + * @val: shifted value to store
> + * @msk: mask for value to store
> + *
> + * Atomically store @val to @p masked by @msk.
> + */
> +static void store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
> +{
> +#if defined(CONFIG_ATOMIC128)
> +    __uint128_t *pu, old, new;
> +
> +    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
> +    pu = __builtin_assume_aligned(ps, 16);
> +    old = *pu;
> +    do {
> +        new = (old & ~msk.u) | val.u;
> +    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
> +                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
> +#elif defined(CONFIG_CMPXCHG128)
> +    __uint128_t *pu, old, new;
> +
> +    /*
> +     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
> +     * defer to libatomic, so we must use __sync_val_compare_and_swap_16
> +     * and accept the sequential consistency that comes with it.
> +     */
> +    pu = __builtin_assume_aligned(ps, 16);
> +    do {
> +        old = *pu;
> +        new = (old & ~msk.u) | val.u;
> +    } while (!__sync_bool_compare_and_swap_16(pu, old, new));

Comment says "__sync_val..." but code says "__sync_bool...". Which is right?


> +#else
> +    qemu_build_not_reached();
> +#endif
> +}

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 08/57] target/loongarch: Do not include tcg-ldst.h
  2023-05-03  7:06 ` [PATCH v4 08/57] target/loongarch: Do not include tcg-ldst.h Richard Henderson
@ 2023-05-05  9:29   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05  9:29 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> This header is supposed to be private to tcg and in fact
> does not need to be included here at all.
>
> Reviewed-by: Song Gao <gaosong@loongson.cn>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 09/57] tcg: Unify helper_{be,le}_{ld,st}*
  2023-05-03  7:06 ` [PATCH v4 09/57] tcg: Unify helper_{be,le}_{ld,st}* Richard Henderson
@ 2023-05-05  9:36   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05  9:36 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:15, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> With the current structure of cputlb.c, there is no difference
> between the little-endian and big-endian entry points, aside
> from the assert.  Unify the pairs of functions.
>
> Hoist the qemu_{ld,st}_helpers arrays to tcg.c.
>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 10/57] accel/tcg: Implement helper_{ld, st}*_mmu for user-only
  2023-05-03  7:06 ` [PATCH v4 10/57] accel/tcg: Implement helper_{ld, st}*_mmu for user-only Richard Henderson
@ 2023-05-05  9:43   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05  9:43 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:23, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> TCG backends may need to defer to a helper to implement
> the atomicity required by a given operation.  Mirror the
> interface used in system mode.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  include/tcg/tcg-ldst.h |   6 +-
>  accel/tcg/user-exec.c  | 393 ++++++++++++++++++++++++++++-------------
>  tcg/tcg.c              |   6 +-
>  3 files changed, 278 insertions(+), 127 deletions(-)
>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 11/57] tcg/tci: Use helper_{ld,st}*_mmu for user-only
  2023-05-03  7:06 ` [PATCH v4 11/57] tcg/tci: Use helper_{ld,st}*_mmu " Richard Henderson
@ 2023-05-05  9:44   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05  9:44 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:08, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> We can now fold these two pieces of code.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 12/57] tcg: Add 128-bit guest memory primitives
  2023-05-03  7:06 ` [PATCH v4 12/57] tcg: Add 128-bit guest memory primitives Richard Henderson
@ 2023-05-05 10:04   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 10:04 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:17, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---



> +/**
> + * load_atom_16:
> + * @p: host address
> + * @memop: the full memory op
> + *
> + * Load 16 bytes from @p, honoring the atomicity of @memop.
> + */
> +static Int128 load_atom_16(CPUArchState *env, uintptr_t ra,
> +                           void *pv, MemOp memop)
> +{
> +    uintptr_t pi = (uintptr_t)pv;
> +    int atmax;
> +    Int128 r;
> +    uint64_t a, b;
> +
> +    /*
> +     * If the host does not support 8-byte atomics, wait until we have
> +     * examined the atomicity parameters below.
> +     */
> +    if (HAVE_al16_fast && likely((pi & 15) == 0)) {
> +        return load_atomic16(pv);
> +    }

Comment says "8-byte atomics" but code is testing for
existence of 16-byte atomics ?

> +
> +    atmax = required_atomicity(env, pi, memop);
> +    switch (atmax) {
> +    case MO_8:
> +        memcpy(&r, pv, 16);
> +        return r;
> +    case MO_16:
> +        a = load_atom_8_by_2(pv);
> +        b = load_atom_8_by_2(pv + 8);
> +        break;
> +    case MO_32:
> +        a = load_atom_8_by_4(pv);
> +        b = load_atom_8_by_4(pv + 8);
> +        break;
> +    case MO_64:
> +        if (!HAVE_al8) {
> +            cpu_loop_exit_atomic(env_cpu(env), ra);
> +        }
> +        a = load_atomic8(pv);
> +        b = load_atomic8(pv + 8);
> +        break;
> +    case -MO_64:
> +        if (!HAVE_al8) {
> +            cpu_loop_exit_atomic(env_cpu(env), ra);
> +        }
> +        a = load_atom_extract_al8x2(pv);
> +        b = load_atom_extract_al8x2(pv + 8);
> +        break;
> +    case MO_128:
> +        return load_atomic16_or_exit(env, ra, pv);
> +    default:
> +        g_assert_not_reached();
> +    }
> +    return int128_make128(HOST_BIG_ENDIAN ? b : a, HOST_BIG_ENDIAN ? a : b);
> + }

> +/**
> + * store_atomic16:
> + * @pv: host address
> + * @val: value to store
> + *
> + * Atomically store 16 aligned bytes to @pv.
> + */
> +static inline void store_atomic16(void *pv, Int128 val)
> +{
> +#if defined(CONFIG_ATOMIC128)
> +    __uint128_t *pu = __builtin_assume_aligned(pv, 16);
> +    Int128Alias new;
> +
> +    new.s = val;
> +    qatomic_set__nocheck(pu, new.u);
> +#elif defined(CONFIG_CMPXCHG128)
> +    __uint128_t *pu = __builtin_assume_aligned(pv, 16);
> +    __uint128_t o;
> +    Int128Alias n;
> +
> +    /*
> +     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
> +     * defer to libatomic, so we must use __sync_val_compare_and_swap_16
> +     * and accept the sequential consistency that comes with it.
> +     */
> +    n.s = val;
> +    do {
> +        o = *pu;
> +    } while (!__sync_bool_compare_and_swap_16(pu, o, n.u));

Same val vs bool thing as the other patch.


Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 13/57] meson: Detect atomic128 support with optimization
  2023-05-03  7:06 ` [PATCH v4 13/57] meson: Detect atomic128 support with optimization Richard Henderson
@ 2023-05-05 10:29   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 10:29 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:29, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> There is an edge condition prior to gcc13 for which optimization
> is required to generate 16-byte atomic sequences.  Detect this.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  accel/tcg/ldst_atomicity.c.inc | 38 ++++++++++++++++++-------
>  meson.build                    | 52 ++++++++++++++++++++++------------
>  2 files changed, 61 insertions(+), 29 deletions(-)
>



> @@ -676,28 +695,24 @@ static inline void store_atomic8(void *pv, uint64_t val)
>   *
>   * Atomically store 16 aligned bytes to @pv.
>   */
> -static inline void store_atomic16(void *pv, Int128 val)
> +static inline void ATTRIBUTE_ATOMIC128_OPT
> +store_atomic16(void *pv, Int128Alias val)
>  {
>  #if defined(CONFIG_ATOMIC128)
>      __uint128_t *pu = __builtin_assume_aligned(pv, 16);
> -    Int128Alias new;
> -
> -    new.s = val;
> -    qatomic_set__nocheck(pu, new.u);
> +    qatomic_set__nocheck(pu, val.u);
>  #elif defined(CONFIG_CMPXCHG128)
>      __uint128_t *pu = __builtin_assume_aligned(pv, 16);
>      __uint128_t o;
> -    Int128Alias n;
>
>      /*
>       * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
>       * defer to libatomic, so we must use __sync_val_compare_and_swap_16
>       * and accept the sequential consistency that comes with it.
>       */
> -    n.s = val;
>      do {
>          o = *pu;
> -    } while (!__sync_bool_compare_and_swap_16(pu, o, n.u));
> +    } while (!__sync_bool_compare_and_swap_16(pu, o, val.u));
>  #else
>      qemu_build_not_reached();
>  #endif

Should this change be in a different patch? It doesn't
seem related to the meson detection.



Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 14/57] tcg/i386: Add have_atomic16
  2023-05-03  7:06 ` [PATCH v4 14/57] tcg/i386: Add have_atomic16 Richard Henderson
@ 2023-05-05 10:34   ` Peter Maydell
  2023-05-08 13:41     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 10:34 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:10, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Notice when Intel or AMD have guaranteed that vmovdqa is atomic.
> The new variable will also be used in generated code.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  include/qemu/cpuid.h      | 18 ++++++++++++++++++
>  tcg/i386/tcg-target.h     |  1 +
>  tcg/i386/tcg-target.c.inc | 27 +++++++++++++++++++++++++++
>  3 files changed, 46 insertions(+)
>
> diff --git a/include/qemu/cpuid.h b/include/qemu/cpuid.h
> index 1451e8ef2f..35325f1995 100644
> --- a/include/qemu/cpuid.h
> +++ b/include/qemu/cpuid.h
> @@ -71,6 +71,24 @@
>  #define bit_LZCNT       (1 << 5)
>  #endif
>
> +/*
> + * Signatures for different CPU implementations as returned from Leaf 0.
> + */
> +
> +#ifndef signature_INTEL_ecx
> +/* "Genu" "ineI" "ntel" */
> +#define signature_INTEL_ebx     0x756e6547
> +#define signature_INTEL_edx     0x49656e69
> +#define signature_INTEL_ecx     0x6c65746e
> +#endif
> +
> +#ifndef signature_AMD_ecx
> +/* "Auth" "enti" "cAMD" */
> +#define signature_AMD_ebx       0x68747541
> +#define signature_AMD_edx       0x69746e65
> +#define signature_AMD_ecx       0x444d4163
> +#endif

> @@ -4024,6 +4025,32 @@ static void tcg_target_init(TCGContext *s)
>                      have_avx512dq = (b7 & bit_AVX512DQ) != 0;
>                      have_avx512vbmi2 = (c7 & bit_AVX512VBMI2) != 0;
>                  }
> +
> +                /*
> +                 * The Intel SDM has added:
> +                 *   Processors that enumerate support for Intel® AVX
> +                 *   (by setting the feature flag CPUID.01H:ECX.AVX[bit 28])
> +                 *   guarantee that the 16-byte memory operations performed
> +                 *   by the following instructions will always be carried
> +                 *   out atomically:
> +                 *   - MOVAPD, MOVAPS, and MOVDQA.
> +                 *   - VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
> +                 *   - VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded
> +                 *     with EVEX.128 and k0 (masking disabled).
> +                 * Note that these instructions require the linear addresses
> +                 * of their memory operands to be 16-byte aligned.
> +                 *
> +                 * AMD has provided an even stronger guarantee that processors
> +                 * with AVX provide 16-byte atomicity for all cachable,
> +                 * naturally aligned single loads and stores, e.g. MOVDQU.
> +                 *
> +                 * See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
> +                 */
> +                if (have_avx1) {
> +                    __cpuid(0, a, b, c, d);
> +                    have_atomic16 = (c == signature_INTEL_ecx ||
> +                                     c == signature_AMD_ecx);
> +                }

If the signature is 3 words why are we only checking one here ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 38/57] tcg/riscv: Support softmmu unaligned accesses
  2023-05-03  7:06 ` [PATCH v4 38/57] tcg/riscv: " Richard Henderson
@ 2023-05-05 10:35   ` LIU Zhiwei
  0 siblings, 0 replies; 132+ messages in thread
From: LIU Zhiwei @ 2023-05-05 10:35 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x


On 2023/5/3 15:06, Richard Henderson wrote:
> The system is required to emulate unaligned accesses, even if the
> hardware does not support it.  The resulting trap may or may not
> be more efficient than the qemu slow path.  There are linux kernel
> patches in flight to allow userspace to query hardware support;
> we can re-evaluate whether to enable this by default after that.
>
> In the meantime, softmmu now matches useronly, where we already
> assumed that unaligned accesses are supported.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   tcg/riscv/tcg-target.c.inc | 48 ++++++++++++++++++++++----------------
>   1 file changed, 28 insertions(+), 20 deletions(-)
>
> diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
> index 19cd4507fb..415e6c6e15 100644
> --- a/tcg/riscv/tcg-target.c.inc
> +++ b/tcg/riscv/tcg-target.c.inc
> @@ -910,12 +910,13 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
>   
>   #ifdef CONFIG_SOFTMMU
>       unsigned s_bits = opc & MO_SIZE;
> +    unsigned s_mask = (1u << s_bits) - 1;
>       int mem_index = get_mmuidx(oi);
>       int fast_ofs = TLB_MASK_TABLE_OFS(mem_index);
>       int mask_ofs = fast_ofs + offsetof(CPUTLBDescFast, mask);
>       int table_ofs = fast_ofs + offsetof(CPUTLBDescFast, table);
> -    TCGReg mask_base = TCG_AREG0, table_base = TCG_AREG0;
> -    tcg_target_long compare_mask;
> +    int compare_mask;
> +    TCGReg addr_adj;
>   
>       ldst = new_ldst_label(s);
>       ldst->is_ld = is_ld;
> @@ -924,14 +925,33 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
>   
>       QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) > 0);
>       QEMU_BUILD_BUG_ON(TLB_MASK_TABLE_OFS(0) < -(1 << 11));
> -    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, mask_base, mask_ofs);
> -    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, table_base, table_ofs);
> +    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP0, TCG_AREG0, mask_ofs);
> +    tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP1, TCG_AREG0, table_ofs);
>   
>       tcg_out_opc_imm(s, OPC_SRLI, TCG_REG_TMP2, addr_reg,
>                       TARGET_PAGE_BITS - CPU_TLB_ENTRY_BITS);
>       tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP0);
>       tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP2, TCG_REG_TMP2, TCG_REG_TMP1);
>   
> +    /*
> +     * For aligned accesses, we check the first byte and include the alignment
> +     * bits within the address.  For unaligned access, we check that we don't
> +     * cross pages using the address of the last byte of the access.
> +     */
> +    addr_adj = addr_reg;
> +    if (a_bits < s_bits) {
> +        addr_adj = TCG_REG_TMP0;
> +        tcg_out_opc_imm(s, TARGET_LONG_BITS == 32 ? OPC_ADDIW : OPC_ADDI,
> +                        addr_adj, addr_reg, s_mask - a_mask);
> +    }
> +    compare_mask = TARGET_PAGE_MASK | a_mask;
> +    if (compare_mask == sextreg(compare_mask, 0, 12)) {
> +        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_adj, compare_mask);
> +    } else {
> +        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
> +        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr_adj);
> +    }
> +
>       /* Load the tlb comparator and the addend.  */
>       tcg_out_ld(s, TCG_TYPE_TL, TCG_REG_TMP0, TCG_REG_TMP2,
>                  is_ld ? offsetof(CPUTLBEntry, addr_read)
> @@ -939,29 +959,17 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
>       tcg_out_ld(s, TCG_TYPE_PTR, TCG_REG_TMP2, TCG_REG_TMP2,
>                  offsetof(CPUTLBEntry, addend));
>   
> -    /* We don't support unaligned accesses. */
> -    if (a_bits < s_bits) {
> -        a_bits = s_bits;
> -    }
> -    /* Clear the non-page, non-alignment bits from the address.  */
> -    compare_mask = (tcg_target_long)TARGET_PAGE_MASK | a_mask;
> -    if (compare_mask == sextreg(compare_mask, 0, 12)) {
> -        tcg_out_opc_imm(s, OPC_ANDI, TCG_REG_TMP1, addr_reg, compare_mask);
> -    } else {
> -        tcg_out_movi(s, TCG_TYPE_TL, TCG_REG_TMP1, compare_mask);
> -        tcg_out_opc_reg(s, OPC_AND, TCG_REG_TMP1, TCG_REG_TMP1, addr_reg);
> -    }
> -
>       /* Compare masked address with the TLB entry. */
>       ldst->label_ptr[0] = s->code_ptr;
>       tcg_out_opc_branch(s, OPC_BNE, TCG_REG_TMP0, TCG_REG_TMP1, 0);
>   
>       /* TLB Hit - translate address using addend.  */
> +    addr_adj = addr_reg;
>       if (TARGET_LONG_BITS == 32) {
> -        tcg_out_ext32u(s, TCG_REG_TMP0, addr_reg);
> -        addr_reg = TCG_REG_TMP0;
> +        addr_adj = TCG_REG_TMP0;
> +        tcg_out_ext32u(s, addr_adj, addr_reg);
>       }
> -    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr_reg);
> +    tcg_out_opc_reg(s, OPC_ADD, TCG_REG_TMP0, TCG_REG_TMP2, addr_adj);

Reviewed-by: LIU Zhiwei <zhiwei_liu@linux.alibaba.com>

Zhiwei

>       *pbase = TCG_REG_TMP0;
>   #else
>       if (a_mask) {


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc
  2023-05-03  7:06 ` [PATCH v4 15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc Richard Henderson
@ 2023-05-05 10:37   ` Peter Maydell
  2023-05-08 13:48     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 10:37 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:08, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Hosts using Intel and AMD AVX cpus are quite common.
> Add fast paths through ldst_atomicity using this.
>
> Only enable with CONFIG_INT128; some older clang versions do not
> support __int128_t, and the inline assembly won't work on structures.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  accel/tcg/ldst_atomicity.c.inc | 76 +++++++++++++++++++++++++++-------
>  1 file changed, 60 insertions(+), 16 deletions(-)

Inline x86 asm in a bit of generic code seems rather awkward.
Ideally the compiler should be doing this for us; failing
that can we at least abstract out the operations to a
set of functions that we can provide (or not provide)
implementations of?

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 17/57] tcg/aarch64: Detect have_lse, have_lse2 for linux
  2023-05-03  7:06 ` [PATCH v4 17/57] tcg/aarch64: Detect have_lse, have_lse2 for linux Richard Henderson
@ 2023-05-05 10:41   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 10:41 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:08, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Notice when the host has additional atomic instructions.
> The new variables will also be used in generated code.
>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 18/57] tcg/aarch64: Detect have_lse, have_lse2 for darwin
  2023-05-03  7:06 ` [PATCH v4 18/57] tcg/aarch64: Detect have_lse, have_lse2 for darwin Richard Henderson
@ 2023-05-05 10:43   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 10:43 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:19, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> These features are present for Apple M1.
>
> Tested-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> +#ifdef CONFIG_DARWIN
> +static bool sysctl_for_bool(const char *name)
> +{
> +    int val = 0;
> +    size_t len = sizeof(val);
> +
> +    if (sysctlbyname(name, &val, &len, NULL, 0) == 0) {
> +        return val != 0;
> +    }
> +
> +    /*
> +     * We might in ask for properties not present in older kernels,

"might in ask" is a typo for something, but I'm not sure what.

> +     * but we're only asking about static properties, all of which
> +     * should be 'int'.  So we shouln't see ENOMEM (val too small),
> +     * or any of the other more exotic errors.
> +     */
> +    assert(errno == ENOENT);
> +    return false;
> +}
> +#endif

Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 20/57] tcg: Introduce TCG_OPF_TYPE_MASK
  2023-05-03  7:06 ` [PATCH v4 20/57] tcg: Introduce TCG_OPF_TYPE_MASK Richard Henderson
@ 2023-05-05 10:45   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 10:45 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:09, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Reorg TCG_OPF_64BIT and TCG_OPF_VECTOR into a two-bit field so
> that we can add TCG_OPF_128BIT without requiring another bit.
>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 21/57] tcg/i386: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 21/57] tcg/i386: Use full load/store helpers in user-only mode Richard Henderson
@ 2023-05-05 12:01   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:01 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:09, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 22/57] tcg/aarch64: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 22/57] tcg/aarch64: " Richard Henderson
@ 2023-05-05 12:06   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:06 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:12, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> --

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 23/57] tcg/ppc: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 23/57] tcg/ppc: " Richard Henderson
@ 2023-05-05 12:07   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:07 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/ppc/tcg-target.c.inc | 44 ----------------------------------------
>  1 file changed, 44 deletions(-)
>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 24/57] tcg/loongarch64: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 24/57] tcg/loongarch64: " Richard Henderson
@ 2023-05-05 12:07   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:07 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 25/57] tcg/riscv: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 25/57] tcg/riscv: " Richard Henderson
@ 2023-05-05 12:07   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:07 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:27, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 26/57] tcg/arm: Adjust constraints on qemu_ld/st
  2023-05-03  7:06 ` [PATCH v4 26/57] tcg/arm: Adjust constraints on qemu_ld/st Richard Henderson
@ 2023-05-05 12:14   ` Peter Maydell
  2023-05-08 15:13     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:10, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Always reserve r3 for tlb softmmu lookup.  Fix a bug in user-only
> ALL_QLDST_REGS, in that r14 is clobbered by the BLNE that leads
> to the misaligned trap.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

>  /*
> - * r0-r2 will be overwritten when reading the tlb entry (softmmu only)
> - * and r0-r1 doing the byte swapping, so don't use these.
> - * r3 is removed for softmmu to avoid clashes with helper arguments.
> + * r0-r3 will be overwritten when reading the tlb entry (softmmu only);
> + * r14 will be overwritten by the BLNE branching to the slow path.
>   */
>  #ifdef CONFIG_SOFTMMU
> -#define ALL_QLOAD_REGS \
> +#define ALL_QLDST_REGS \
>      (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1) | \
>                            (1 << TCG_REG_R2) | (1 << TCG_REG_R3) | \
>                            (1 << TCG_REG_R14)))
> -#define ALL_QSTORE_REGS \
> -    (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1) | \
> -                          (1 << TCG_REG_R2) | (1 << TCG_REG_R14) | \
> -                          ((TARGET_LONG_BITS == 64) << TCG_REG_R3)))
>  #else
> -#define ALL_QLOAD_REGS   ALL_GENERAL_REGS
> -#define ALL_QSTORE_REGS \
> -    (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1)))
> +#define ALL_QLDST_REGS   (ALL_GENERAL_REGS & ~(1 << TCG_REG_R14))
>  #endif

Why is it OK not to remove r0 and r1 from this any more ?
The commit message doesn't say anything about this bit of the change.

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 27/57] tcg/arm: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 27/57] tcg/arm: Use full load/store helpers in user-only mode Richard Henderson
@ 2023-05-05 12:15   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:15 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:09, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 28/57] tcg/mips: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 28/57] tcg/mips: " Richard Henderson
@ 2023-05-05 12:15   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:15 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:17, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 29/57] tcg/s390x: Use full load/store helpers in user-only mode
  2023-05-03  7:06 ` [PATCH v4 29/57] tcg/s390x: " Richard Henderson
@ 2023-05-05 12:16   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:16 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:29, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Instead of using helper_unaligned_{ld,st}, use the full load/store helpers.
> This will allow the fast path to increase alignment to implement atomicity
> while not immediately raising an alignment exception.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary
  2023-05-03  7:06 ` [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary Richard Henderson
@ 2023-05-05 12:19   ` Peter Maydell
  2023-05-08 15:17     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:19 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:17, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc64/tcg-target.c.inc | 15 +++++++--------
>  1 file changed, 7 insertions(+), 8 deletions(-)
>
> diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
> index e997db2645..64464ab363 100644
> --- a/tcg/sparc64/tcg-target.c.inc
> +++ b/tcg/sparc64/tcg-target.c.inc
> @@ -83,9 +83,10 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>  #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
>  #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
>
> -/* Define some temporary registers.  T2 is used for constant generation.  */
> +/* Define some temporary registers.  T3 is used for constant generation.  */
>  #define TCG_REG_T1  TCG_REG_G1
> -#define TCG_REG_T2  TCG_REG_O7
> +#define TCG_REG_T2  TCG_REG_G2
> +#define TCG_REG_T3  TCG_REG_O7
>
>  #ifndef CONFIG_SOFTMMU
>  # define TCG_GUEST_BASE_REG TCG_REG_I5
> @@ -110,7 +111,6 @@ static const int tcg_target_reg_alloc_order[] = {
>      TCG_REG_I4,
>      TCG_REG_I5,
>
> -    TCG_REG_G2,
>      TCG_REG_G3,
>      TCG_REG_G4,
>      TCG_REG_G5,
> @@ -492,8 +492,8 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>  static void tcg_out_movi(TCGContext *s, TCGType type,
>                           TCGReg ret, tcg_target_long arg)
>  {
> -    tcg_debug_assert(ret != TCG_REG_T2);
> -    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
> +    tcg_debug_assert(ret != TCG_REG_T3);
> +    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T3);
>  }

Why do we need to change this usage of TCG_REG_T2 but not
any of the others ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 31/57] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13
  2023-05-03  7:06 ` [PATCH v4 31/57] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13 Richard Henderson
@ 2023-05-05 12:20   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:20 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:12, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Emphasize that the constant is signed.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc64/tcg-target.c.inc | 30 +++++++++++++++---------------
>  1 file changed, 15 insertions(+), 15 deletions(-)

Commit message says we're just doing a rename, but...

> @@ -425,15 +425,15 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>      tcg_target_long hi, lo = (int32_t)arg;
>      tcg_target_long test, lsb;
>
> -    /* A 32-bit constant, or 32-bit zero-extended to 64-bits.  */
> -    if (type == TCG_TYPE_I32 || arg == (uint32_t)arg) {
> -        tcg_out_movi_imm32(s, ret, arg);
> +    /* A 13-bit constant sign-extended to 64-bits.  */
> +    if (check_fit_tl(arg, 13)) {
> +        tcg_out_movi_s13(s, ret, arg);
>          return;
>      }
>
> -    /* A 13-bit constant sign-extended to 64-bits.  */
> -    if (check_fit_tl(arg, 13)) {
> -        tcg_out_movi_imm13(s, ret, arg);
> +    /* A 32-bit constant, or 32-bit zero-extended to 64-bits.  */
> +    if (type == TCG_TYPE_I32 || arg == (uint32_t)arg) {
> +        tcg_out_movi_imm32(s, ret, arg);
>          return;
>      }

...the commit has other code changes. Should these be in some
other patch ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 32/57] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32
  2023-05-03  7:06 ` [PATCH v4 32/57] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32 Richard Henderson
@ 2023-05-05 12:22   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:22 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:15, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Emphasize that the constant is unsigned.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc64/tcg-target.c.inc | 24 ++++++++++--------------
>  1 file changed, 10 insertions(+), 14 deletions(-)
>
> diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
> index 2e6127d506..e244209890 100644
> --- a/tcg/sparc64/tcg-target.c.inc
> +++ b/tcg/sparc64/tcg-target.c.inc
> @@ -399,22 +399,18 @@ static void tcg_out_sethi(TCGContext *s, TCGReg ret, uint32_t arg)
>      tcg_out32(s, SETHI | INSN_RD(ret) | ((arg & 0xfffffc00) >> 10));
>  }
>
> +/* A 13-bit constant sign-extended to 64 bits.  */

This should have been in the previous patch, I think ?

>  static void tcg_out_movi_s13(TCGContext *s, TCGReg ret, int32_t arg)
>  {
>      tcg_out_arithi(s, ret, TCG_REG_G0, arg, ARITH_OR);
>  }
>
> -static void tcg_out_movi_imm32(TCGContext *s, TCGReg ret, int32_t arg)
> +/* A 32-bit constant zero-extended to 64 bits.  */
> +static void tcg_out_movi_u32(TCGContext *s, TCGReg ret, uint32_t arg)
>  {
> -    if (check_fit_i32(arg, 13)) {
> -        /* A 13-bit constant sign-extended to 64-bits.  */
> -        tcg_out_movi_s13(s, ret, arg);
> -    } else {
> -        /* A 32-bit constant zero-extended to 64 bits.  */
> -        tcg_out_sethi(s, ret, arg);
> -        if (arg & 0x3ff) {
> -            tcg_out_arithi(s, ret, ret, arg & 0x3ff, ARITH_OR);
> -        }
> +    tcg_out_sethi(s, ret, arg);
> +    if (arg & 0x3ff) {
> +        tcg_out_arithi(s, ret, ret, arg & 0x3ff, ARITH_OR);
>      }
>  }

More code changes in a patch that says it's a rename-only.

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 33/57] tcg/sparc64: Split out tcg_out_movi_s32
  2023-05-03  7:06 ` [PATCH v4 33/57] tcg/sparc64: Split out tcg_out_movi_s32 Richard Henderson
@ 2023-05-05 12:23   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:23 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:09, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc64/tcg-target.c.inc | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 34/57] tcg/sparc64: Use standard slow path for softmmu
  2023-05-03  7:06 ` [PATCH v4 34/57] tcg/sparc64: Use standard slow path for softmmu Richard Henderson
@ 2023-05-05 12:26   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:26 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:11, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Drop the target-specific trampolines for the standard slow path.
> This lets us use tcg_out_helper_{ld,st}_args, and handles the new
> atomicity bits within MemOp.
>
> At the same time, use the full load/store helpers for user-only mode.
> Drop inline unaligned access support for user-only mode, as it does
> not handle atomicity.
>
> Use TCG_REG_T[1-3] in the tlb lookup, instead of TCG_REG_O[0-2].
> This allows the constraints to be simplified.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc64/tcg-target-con-set.h |   2 -
>  tcg/sparc64/tcg-target-con-str.h |   1 -
>  tcg/sparc64/tcg-target.h         |   1 +
>  tcg/sparc64/tcg-target.c.inc     | 610 +++++++++----------------------
>  4 files changed, 182 insertions(+), 432 deletions(-)

This is a pretty big patch, and git diff hasn't done it any
favours either. Perhaps I'll have more stamina for it in v5,
or maybe somebody else will look at it...

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 35/57] accel/tcg: Remove helper_unaligned_{ld,st}
  2023-05-03  7:06 ` [PATCH v4 35/57] accel/tcg: Remove helper_unaligned_{ld,st} Richard Henderson
@ 2023-05-05 12:27   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:27 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:19, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> These functions are now unused.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses
  2023-05-03  7:06 ` [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses Richard Henderson
@ 2023-05-05 12:30   ` Peter Maydell
  2023-05-05 13:24   ` WANG Xuerui
  1 sibling, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:30 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:15, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> This should be true of all server class loongarch64.

By this do you mean "anything that runs Linux" ?

If not, we should be a bit more user-friendly about bailing
out than just assert()ing. For the "tried to run on an
ARMv5" case we use error_report and exit:

            error_report("TCG: ARMv%d is unsupported; exiting", arm_arch);
            exit(EXIT_FAILURE);

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 37/57] tcg/loongarch64: Support softmmu unaligned accesses
  2023-05-03  7:06 ` [PATCH v4 37/57] tcg/loongarch64: Support softmmu " Richard Henderson
@ 2023-05-05 12:35   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:35 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:25, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Test the final byte of an unaligned access.
> Use BSTRINS.D to clear the range of bits, rather than AND.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

(at least, it's the same general shape as other architectures;
I have no loongarch-specific knowledge)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 39/57] tcg: Introduce tcg_target_has_memory_bswap
  2023-05-03  7:06 ` [PATCH v4 39/57] tcg: Introduce tcg_target_has_memory_bswap Richard Henderson
@ 2023-05-05 12:41   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:41 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:20, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Replace the unparameterized TCG_TARGET_HAS_MEMORY_BSWAP macro
> with a function with a memop argument.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 40/57] tcg: Add INDEX_op_qemu_{ld,st}_i128
  2023-05-03  7:06 ` [PATCH v4 40/57] tcg: Add INDEX_op_qemu_{ld,st}_i128 Richard Henderson
@ 2023-05-05 12:45   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:45 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:13, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Add opcodes for backend support for 128-bit memory operations.
>
> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 41/57] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret}
  2023-05-03  7:06 ` [PATCH v4 41/57] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret} Richard Henderson
@ 2023-05-05 12:53   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 12:53 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:12, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 42/57] tcg: Introduce atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 42/57] tcg: Introduce atom_and_align_for_opc Richard Henderson
@ 2023-05-05 13:03   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:03 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:27, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Examine MemOp for atomicity and alignment, adjusting alignment
> as required to implement atomicity on the host.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 43/57] tcg/i386: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 43/57] tcg/i386: Use atom_and_align_for_opc Richard Henderson
@ 2023-05-05 13:14   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:14 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:12, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> No change to the ultimate load/store routines yet, so some atomicity
> conditions not yet honored, but plumbs the change to alignment through
> the relevant functions.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 44/57] tcg/aarch64: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 44/57] tcg/aarch64: " Richard Henderson
@ 2023-05-05 13:15   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:15 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:14, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/aarch64/tcg-target.c.inc | 38 +++++++++++++++++++-----------------
>  1 file changed, 20 insertions(+), 18 deletions(-)
>


Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 45/57] tcg/arm: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 45/57] tcg/arm: " Richard Henderson
@ 2023-05-05 13:15   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:15 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:12, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> No change to the ultimate load/store routines yet, so some atomicity
> conditions not yet honored, but plumbs the change to alignment through
> the relevant functions.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 46/57] tcg/loongarch64: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 46/57] tcg/loongarch64: " Richard Henderson
@ 2023-05-05 13:16   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:16 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:12, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 47/57] tcg/mips: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 47/57] tcg/mips: " Richard Henderson
@ 2023-05-05 13:17   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:17 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:41, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/mips/tcg-target.c.inc | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/tcg/mips/tcg-target.c.inc b/tcg/mips/tcg-target.c.inc
> index cd0254a0d7..43a8ffac17 100644
> --- a/tcg/mips/tcg-target.c.inc
> +++ b/tcg/mips/tcg-target.c.inc
> @@ -1139,6 +1139,7 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *l)
>  typedef struct {
>      TCGReg base;
>      MemOp align;
> +    MemOp atom;
>  } HostAddress;
>
>  bool tcg_target_has_memory_bswap(MemOp memop)
> @@ -1158,11 +1159,16 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>  {
>      TCGLabelQemuLdst *ldst = NULL;
>      MemOp opc = get_memop(oi);
> -    unsigned a_bits = get_alignment_bits(opc);
> +    MemOp a_bits, atom_u;
>      unsigned s_bits = opc & MO_SIZE;
> -    unsigned a_mask = (1 << a_bits) - 1;
> +    unsigned a_mask;
>      TCGReg base;
>
> +    a_bits = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
> +                                    MO_ATOM_IFALIGN, false);
> +    h->align = a_bits;
> +    a_mask = (1 << a_bits) - 1;
> +
>  #ifdef CONFIG_SOFTMMU
>      unsigned s_mask = (1 << s_bits) - 1;
>      int mem_index = get_mmuidx(oi);

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

(For a set of functions so new they're not even in master yet
these do seem to have a surprisingly large variance in how
they're setting up these HostAddress structs...)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 48/57] tcg/ppc: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 48/57] tcg/ppc: " Richard Henderson
@ 2023-05-05 13:18   ` Peter Maydell
  2023-05-08 17:32     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:18 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:13, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/ppc/tcg-target.c.inc | 17 ++++++++++++++++-
>  1 file changed, 16 insertions(+), 1 deletion(-)
>
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index f0a4118bbb..60375804cd 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -2034,7 +2034,22 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>  {
>      TCGLabelQemuLdst *ldst = NULL;
>      MemOp opc = get_memop(oi);
> -    unsigned a_bits = get_alignment_bits(opc);
> +    MemOp a_bits, atom_a, atom_u;
> +
> +    /*
> +     * Book II, Section 1.4, Single-Copy Atomicity, specifies:
> +     *
> +     * Before 3.0, "An access that is not atomic is performed as a set of
> +     * smaller disjoint atomic accesses. In general, the number and alignment
> +     * of these accesses are implementation-dependent."  Thus MO_ATOM_IFALIGN.
> +     *
> +     * As of 3.0, "the non-atomic access is performed as described in
> +     * the corresponding list", which matches MO_ATOM_SUBALIGN.
> +     */
> +    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
> +                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
> +                                                  : MO_ATOM_IFALIGN,
> +                                    false);
>

Why doesn't this patch have changes to a HostAddress struct
like all the other archs ?

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 49/57] tcg/riscv: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 49/57] tcg/riscv: " Richard Henderson
@ 2023-05-05 13:19   ` Peter Maydell
  2023-05-08 17:33     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:19 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:13, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/riscv/tcg-target.c.inc | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
> index 37870c89fc..4dd33c73e8 100644
> --- a/tcg/riscv/tcg-target.c.inc
> +++ b/tcg/riscv/tcg-target.c.inc
> @@ -910,8 +910,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
>  {
>      TCGLabelQemuLdst *ldst = NULL;
>      MemOp opc = get_memop(oi);
> -    unsigned a_bits = get_alignment_bits(opc);
> -    unsigned a_mask = (1u << a_bits) - 1;
> +    MemOp a_bits, atom_a, atom_u;
> +    unsigned a_mask;
> +
> +    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
> +                                    MO_ATOM_IFALIGN, false);
> +    a_mask = (1u << a_bits) - 1;
>
>  #ifdef CONFIG_SOFTMMU
>      unsigned s_bits = opc & MO_SIZE;

Same remark as for ppc.

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 50/57] tcg/s390x: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 50/57] tcg/s390x: " Richard Henderson
@ 2023-05-05 13:20   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:20 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:20, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/s390x/tcg-target.c.inc | 14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 51/57] tcg/sparc64: Use atom_and_align_for_opc
  2023-05-03  7:06 ` [PATCH v4 51/57] tcg/sparc64: " Richard Henderson
@ 2023-05-05 13:20   ` Peter Maydell
  2023-05-08 17:34     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:20 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:13, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/sparc64/tcg-target.c.inc | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
> index bb23038529..4f9ec02b1f 100644
> --- a/tcg/sparc64/tcg-target.c.inc
> +++ b/tcg/sparc64/tcg-target.c.inc
> @@ -1028,11 +1028,13 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>  {
>      TCGLabelQemuLdst *ldst = NULL;
>      MemOp opc = get_memop(oi);
> -    unsigned a_bits = get_alignment_bits(opc);
> -    unsigned s_bits = opc & MO_SIZE;
> +    MemOp s_bits = opc & MO_SIZE;
> +    MemOp a_bits, atom_a, atom_u;
>      unsigned a_mask;
>
>      /* We don't support unaligned accesses. */
> +    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
> +                                    MO_ATOM_IFALIGN, false);
>      a_bits = MAX(a_bits, s_bits);
>      a_mask = (1u << a_bits) - 1;
>
> --

No changes to HostAddress struct again?

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses
  2023-05-03  7:06 ` [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses Richard Henderson
  2023-05-05 12:30   ` Peter Maydell
@ 2023-05-05 13:24   ` WANG Xuerui
  2023-05-06  2:03     ` Song Gao
  1 sibling, 1 reply; 132+ messages in thread
From: WANG Xuerui @ 2023-05-05 13:24 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

Hi,

On 2023/5/3 15:06, Richard Henderson wrote:
> This should be true of all server class loongarch64.
And desktop-class (i.e. all Loongson-3 series).
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>   tcg/loongarch64/tcg-target.c.inc | 6 ++++++
>   1 file changed, 6 insertions(+)
>
> diff --git a/tcg/loongarch64/tcg-target.c.inc b/tcg/loongarch64/tcg-target.c.inc
> index e651ec5c71..ccc13ffdb4 100644
> --- a/tcg/loongarch64/tcg-target.c.inc
> +++ b/tcg/loongarch64/tcg-target.c.inc
> @@ -30,6 +30,7 @@
>    */
>   
>   #include "../tcg-ldst.c.inc"
> +#include <asm/hwcap.h>
>   
>   #ifdef CONFIG_DEBUG_TCG
>   static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
> @@ -1674,6 +1675,11 @@ static void tcg_target_qemu_prologue(TCGContext *s)
>   
>   static void tcg_target_init(TCGContext *s)
>   {
> +    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
> +
> +    /* All server class loongarch have UAL; only embedded do not. */
> +    assert(hwcap & HWCAP_LOONGARCH_UAL);
> +
It is a bit worrying that a future SoC (the octa-core Loongson 2K3000) 
might get used for light desktop use cases (e.g. laptops) where QEMU is 
arguably relevant, but it's currently unclear whether its LA364 
micro-architecture will have UAL. The Loongson folks may have more to share.
>       tcg_target_available_regs[TCG_TYPE_I32] = ALL_GENERAL_REGS;
>       tcg_target_available_regs[TCG_TYPE_I64] = ALL_GENERAL_REGS;
>   


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode
  2023-05-03  7:06 ` [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode Richard Henderson
@ 2023-05-05 13:27   ` Peter Maydell
  2023-05-08 16:15     ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:27 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:18, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Use the fpu to perform 64-bit loads and stores.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>


> @@ -2091,7 +2095,20 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>              datalo = datahi;
>              datahi = t;
>          }
> -        if (h.base == datalo || h.index == datalo) {
> +        if (h.atom == MO_64) {
> +            /*
> +             * Atomicity requires that we use use a single 8-byte load.
> +             * For simplicity and code size, always use the FPU for this.
> +             * Similar insns using SSE/AVX are merely larger.

I'm surprised there's no performance penalty for throwing old-school
FPU insns into what is presumably otherwise code that's only
using modern SSE.

> +             * Load from memory in one go, then store back to the stack,
> +             * from whence we can load into the correct integer regs.
> +             */
> +            tcg_out_modrm_sib_offset(s, OPC_ESCDF + h.seg, ESCDF_FILD_m64,
> +                                     h.base, h.index, 0, h.ofs);
> +            tcg_out_modrm_offset(s, OPC_ESCDF, ESCDF_FISTP_m64, TCG_REG_ESP, 0);
> +            tcg_out_modrm_offset(s, movop, datalo, TCG_REG_ESP, 0);
> +            tcg_out_modrm_offset(s, movop, datahi, TCG_REG_ESP, 4);
> +        } else if (h.base == datalo || h.index == datalo) {
>              tcg_out_modrm_sib_offset(s, OPC_LEA, datahi,
>                                       h.base, h.index, 0, h.ofs);
>              tcg_out_modrm_offset(s, movop + h.seg, datalo, datahi, 0);

I assume the caller has arranged that the top of the stack
is trashable at this point?

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 53/57] tcg/i386: Support 128-bit load/store with have_atomic16
  2023-05-03  7:06 ` [PATCH v4 53/57] tcg/i386: Support 128-bit load/store with have_atomic16 Richard Henderson
@ 2023-05-05 13:34   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:34 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:13, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/i386/tcg-target.h     |   3 +-
>  tcg/i386/tcg-target.c.inc | 184 +++++++++++++++++++++++++++++++++++++-
>  2 files changed, 182 insertions(+), 5 deletions(-)
>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 54/57] tcg/aarch64: Rename temporaries
  2023-05-03  7:06 ` [PATCH v4 54/57] tcg/aarch64: Rename temporaries Richard Henderson
@ 2023-05-05 13:36   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:36 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:12, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> We will need to allocate a second general-purpose temporary.
> Rename the existing temps to add a distinguishing number.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 55/57] tcg/aarch64: Support 128-bit load/store
  2023-05-03  7:06 ` [PATCH v4 55/57] tcg/aarch64: Support 128-bit load/store Richard Henderson
@ 2023-05-05 13:41   ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:41 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:21, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Use LDXP+STXP when LSE2 is not present and 16-byte atomicity is required,
> and LDP/STP otherwise.  This requires allocating a second general-purpose
> temporary, as Rs cannot overlap Rn in STXP.
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  tcg/aarch64/tcg-target-con-set.h |   2 +
>  tcg/aarch64/tcg-target.h         |   2 +-
>  tcg/aarch64/tcg-target.c.inc     | 181 ++++++++++++++++++++++++++++++-
>  3 files changed, 181 insertions(+), 4 deletions(-)
>

Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 00/57] tcg: Improve atomicity support
  2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
                   ` (56 preceding siblings ...)
  2023-05-03  7:06 ` [PATCH v4 57/57] tcg/s390x: " Richard Henderson
@ 2023-05-05 13:43 ` Peter Maydell
  57 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-05 13:43 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Wed, 3 May 2023 at 08:10, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> v1: https://lore.kernel.org/qemu-devel/20221118094754.242910-1-richard.henderson@linaro.org/
> v2: https://lore.kernel.org/qemu-devel/20230216025739.1211680-1-richard.henderson@linaro.org/
> v3: https://lore.kernel.org/qemu-devel/20230425193146.2106111-1-richard.henderson@linaro.org/
>
> Based-on: 20230503065729.1745843-1-richard.henderson@linaro.org
> ("[PATCH v4 00/54] tcg: Simplify calls to load/store helpers")
>
> The main objective here is to support Arm FEAT_LSE2, which says that any
> single memory access that does not cross a 16-byte boundary is atomic.
> This is the MO_ATOM_WITHIN16 control.
>
> While I'm touching all of this, a secondary objective is to handle the
> atomicity of the IBM machines.  Both Power and s390x treat misaligned
> accesses as atomic on the lsb of the pointer.  For instance, an 8-byte
> access at ptr % 8 == 4 will appear as two atomic 4-byte accesses, and
> ptr % 4 == 2 will appear as four 2-byte accesses.
> This is the MO_ATOM_SUBALIGN control.
>
> By default, acceses are atomic only if aligned, which is the current
> behaviour of the tcg code generator (mostly, anyway, there were bugs).
> This is the MO_ATOM_IFALIGN control.
>
> Further, one can say that a large memory access is really a set of
> contiguous smaller accesses, and we need not provide more atomicity
> than that (modulo MO_ATOM_WITHIN16).  This is the MO_ATMAX_* control.

I've reviewed as much of this series as I'm going to -- I left
a few patches for people who know more about ppc/s390 than me.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx
  2023-05-04 15:02   ` Peter Maydell
@ 2023-05-05 18:57     ` Richard Henderson
  2023-05-07 10:09       ` Peter Maydell
  0 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-05 18:57 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On 5/4/23 16:02, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:15, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Instead of playing with offsetof in various places, use
>> MMUAccessType to index an array.  This is easily defined
>> instead of the previous dummy padding array in the union.
>>
>> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
>> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
> 
>> @@ -1802,7 +1763,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
>>       if (prot & PAGE_WRITE) {
>>           tlb_addr = tlb_addr_write(tlbe);
>>           if (!tlb_hit(tlb_addr, addr)) {
>> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
>> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
>> +                                addr & TARGET_PAGE_MASK)) {
>>                   tlb_fill(env_cpu(env), addr, size,
>>                            MMU_DATA_STORE, mmu_idx, retaddr);
>>                   index = tlb_index(env, mmu_idx, addr);
>> @@ -1835,7 +1797,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
>>       } else /* if (prot & PAGE_READ) */ {
>>           tlb_addr = tlbe->addr_read;

read

>>           if (!tlb_hit(tlb_addr, addr)) {
>> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {

write

>> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_LOAD,
>> +                                addr & TARGET_PAGE_MASK)) {
> 
> This was previously looking at addr_write, but now we pass
> MMU_DATA_LOAD ?

This is the read portion of the read+write check.  The write portion is above in the 
previous hunk.  So this is an existing error, fixed here, and I hadn't noticed.


r~


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 06/57] accel/tcg: Honor atomicity of loads
  2023-05-04 17:17   ` Peter Maydell
@ 2023-05-05 20:19     ` Richard Henderson
  2023-05-09 12:04       ` Peter Maydell
  0 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-05 20:19 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On 5/4/23 18:17, Peter Maydell wrote:
>> +    case MO_ATOM_SUBALIGN:
>> +        tmp = p & -p;
>> +        if (tmp != 0 && tmp < atmax) {
>> +            atmax = tmp;
>> +        }
>> +        break;
> 
> I don't understand the bit manipulation going on here.
> AIUI what we're trying to do is say "if e.g. p is only
> 2-aligned then we only get 2-alignment". But, suppose
> p == 0x1002. Then (p & -p) is 0x2. But that's MO_32,
> not MO_16. Am I missing something ?

You're right, this is missing a ctz32().

> 
> (Also, it would be nice to have a comment mentioning
> what (p & -p) does, so readers don't have to try to
> search for a not very-searchable expression to find out.)
> 
>> +    case MO_ATOM_WITHIN16:
>> +        tmp = p & 15;
>> +        if (tmp + (1 << size) <= 16) {
>> +            atmax = size;
> 
> OK, so this is "whole operation is within 16 bytes,
> whole operation must be atomic"...
> 
>> +        } else if (atmax == size) {
>> +            return MO_8;
> 
> ...but I don't understand the interaction of WITHIN16
> and also specifying an ATMAX value that's not ATMAX_SIZE.

I'm trying to describe e.g. LDP, which if not within16 has two 8-byte elements, one or 
both of which must be atomic.  We will have set MO_ATOM_WITHIN16 | MO_ATMAX_8.

If atmax == size, there is only one element, and since it is not within16, there is no 
atomicity.

>> +        } else if (tmp + (1 << atmax) != 16) {
> 
> Why is this doing an exact inequality check?
> What if you're asking for a load of 8 bytes at
> MO_ATMAX_2 from a pointer that's at an offset of
> 10 bytes from a 16-byte boundary? Then tmp is 10,
> tmp + (1 << atmax) is 12, but we could still do the
> loads at atomicity 2. This doesn't seem to me to be
> any different from the case it does catch where
> the first ATMAX_2-sized unit happens to be the only
> thing in this 16-byte block.

If the LDP is aligned mod 8, but not aligned mod 16, then both 8-byte operations must be 
(separately) atomic, and we return MO_64.

>> +            /*
>> +             * Paired load/store, where the pairs aren't aligned.
>> +             * One of the two must still be handled atomically.
>> +             */
>> +            atmax = -atmax;

... whereas returning -MO_64 tells the caller that we must handle an unaligned atomic 
operations.

>> +    /*
>> +     * If the page is not writable, then assume the value is immutable
>> +     * and requires no locking.  This ignores the case of MAP_SHARED with
>> +     * another process, because the fallback start_exclusive solution
>> +     * provides no protection across processes.
>> +     */
>> +    if (!page_check_range(h2g(pv), 8, PAGE_WRITE)) {
>> +        uint64_t *p = __builtin_assume_aligned(pv, 8);
>> +        return *p;
>> +    }
> 
> This will also do a non-atomic read for the case where
> the guest has mapped the same memory twice at different
> addresses, once read-only and once writeable, I think.
> In theory in that situation we could use start_exclusive.
> But maybe that's a weird corner case we can ignore?

We don't handle multiple mappings at all well.  There is an outstanding bug report about 
read+write vs read+execute mappings for a jit -- our write-protect scheme for flushing TBs 
does not work for that case.

Since we can't detect the multiple mappings at this point, I'm tempted to ignore it.
But you're correct that we could drop this check and let start_exclusive handle it.



r~


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses
  2023-05-05 13:24   ` WANG Xuerui
@ 2023-05-06  2:03     ` Song Gao
  0 siblings, 0 replies; 132+ messages in thread
From: Song Gao @ 2023-05-06  2:03 UTC (permalink / raw)
  To: WANG Xuerui, Richard Henderson, qemu-devel
  Cc: git, philmd, qemu-arm, qemu-riscv, qemu-s390x

Hi,

在 2023/5/5 下午9:24, WANG Xuerui 写道:
> Hi,
>
> On 2023/5/3 15:06, Richard Henderson wrote:
>> This should be true of all server class loongarch64.
> And desktop-class (i.e. all Loongson-3 series).
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/loongarch64/tcg-target.c.inc | 6 ++++++
>>   1 file changed, 6 insertions(+)
>>
>> diff --git a/tcg/loongarch64/tcg-target.c.inc 
>> b/tcg/loongarch64/tcg-target.c.inc
>> index e651ec5c71..ccc13ffdb4 100644
>> --- a/tcg/loongarch64/tcg-target.c.inc
>> +++ b/tcg/loongarch64/tcg-target.c.inc
>> @@ -30,6 +30,7 @@
>>    */
>>     #include "../tcg-ldst.c.inc"
>> +#include <asm/hwcap.h>
>>     #ifdef CONFIG_DEBUG_TCG
>>   static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>> @@ -1674,6 +1675,11 @@ static void 
>> tcg_target_qemu_prologue(TCGContext *s)
>>     static void tcg_target_init(TCGContext *s)
>>   {
>> +    unsigned long hwcap = qemu_getauxval(AT_HWCAP);
>> +
>> +    /* All server class loongarch have UAL; only embedded do not. */
>> +    assert(hwcap & HWCAP_LOONGARCH_UAL);
>> +
> It is a bit worrying that a future SoC (the octa-core Loongson 2K3000) 
> might get used for light desktop use cases (e.g. laptops) where QEMU 
> is arguably relevant, but it's currently unclear whether its LA364 
> micro-architecture will have UAL. The Loongson folks may have more to 
> share.
'LA364' support UAL.

Thanks.
Song Gao



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx
  2023-05-05 18:57     ` Richard Henderson
@ 2023-05-07 10:09       ` Peter Maydell
  2023-05-08 10:02         ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-07 10:09 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On Fri, 5 May 2023 at 19:57, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/4/23 16:02, Peter Maydell wrote:
> > On Wed, 3 May 2023 at 08:15, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> Instead of playing with offsetof in various places, use
> >> MMUAccessType to index an array.  This is easily defined
> >> instead of the previous dummy padding array in the union.
> >>
> >> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
> >> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
> >> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> >> ---
> >
> >> @@ -1802,7 +1763,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
> >>       if (prot & PAGE_WRITE) {
> >>           tlb_addr = tlb_addr_write(tlbe);
> >>           if (!tlb_hit(tlb_addr, addr)) {
> >> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
> >> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
> >> +                                addr & TARGET_PAGE_MASK)) {
> >>                   tlb_fill(env_cpu(env), addr, size,
> >>                            MMU_DATA_STORE, mmu_idx, retaddr);
> >>                   index = tlb_index(env, mmu_idx, addr);
> >> @@ -1835,7 +1797,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
> >>       } else /* if (prot & PAGE_READ) */ {
> >>           tlb_addr = tlbe->addr_read;
>
> read
>
> >>           if (!tlb_hit(tlb_addr, addr)) {
> >> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
>
> write
>
> >> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_LOAD,
> >> +                                addr & TARGET_PAGE_MASK)) {
> >
> > This was previously looking at addr_write, but now we pass
> > MMU_DATA_LOAD ?
>
> This is the read portion of the read+write check.  The write portion is above in the
> previous hunk.  So this is an existing error, fixed here, and I hadn't noticed.

Yeah, I wondered if it was a pre-existing bug. We should split out
the bug fix.

-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx
  2023-05-07 10:09       ` Peter Maydell
@ 2023-05-08 10:02         ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 10:02 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On 5/7/23 11:09, Peter Maydell wrote:
> On Fri, 5 May 2023 at 19:57, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 5/4/23 16:02, Peter Maydell wrote:
>>> On Wed, 3 May 2023 at 08:15, Richard Henderson
>>> <richard.henderson@linaro.org> wrote:
>>>>
>>>> Instead of playing with offsetof in various places, use
>>>> MMUAccessType to index an array.  This is easily defined
>>>> instead of the previous dummy padding array in the union.
>>>>
>>>> Reviewed-by: Alex Bennée <alex.bennee@linaro.org>
>>>> Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
>>>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>>>> ---
>>>
>>>> @@ -1802,7 +1763,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
>>>>        if (prot & PAGE_WRITE) {
>>>>            tlb_addr = tlb_addr_write(tlbe);
>>>>            if (!tlb_hit(tlb_addr, addr)) {
>>>> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
>>>> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_STORE,
>>>> +                                addr & TARGET_PAGE_MASK)) {
>>>>                    tlb_fill(env_cpu(env), addr, size,
>>>>                             MMU_DATA_STORE, mmu_idx, retaddr);
>>>>                    index = tlb_index(env, mmu_idx, addr);
>>>> @@ -1835,7 +1797,8 @@ static void *atomic_mmu_lookup(CPUArchState *env, target_ulong addr,
>>>>        } else /* if (prot & PAGE_READ) */ {
>>>>            tlb_addr = tlbe->addr_read;
>>
>> read
>>
>>>>            if (!tlb_hit(tlb_addr, addr)) {
>>>> -            if (!VICTIM_TLB_HIT(addr_write, addr)) {
>>
>> write
>>
>>>> +            if (!victim_tlb_hit(env, mmu_idx, index, MMU_DATA_LOAD,
>>>> +                                addr & TARGET_PAGE_MASK)) {
>>>
>>> This was previously looking at addr_write, but now we pass
>>> MMU_DATA_LOAD ?
>>
>> This is the read portion of the read+write check.  The write portion is above in the
>> previous hunk.  So this is an existing error, fixed here, and I hadn't noticed.
> 
> Yeah, I wondered if it was a pre-existing bug. We should split out
> the bug fix.

https://patchew.org/QEMU/20230505204049.352469-1-richard.henderson@linaro.org/

r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 07/57] accel/tcg: Honor atomicity of stores
  2023-05-05  9:28   ` Peter Maydell
@ 2023-05-08 10:11     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 10:11 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 10:28, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:11, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   accel/tcg/cputlb.c             | 103 +++----
>>   accel/tcg/user-exec.c          |  12 +-
>>   accel/tcg/ldst_atomicity.c.inc | 491 +++++++++++++++++++++++++++++++++
>>   3 files changed, 540 insertions(+), 66 deletions(-)
> 
> 
>> +/**
>> + * store_atom_insert_al16:
>> + * @p: host address
>> + * @val: shifted value to store
>> + * @msk: mask for value to store
>> + *
>> + * Atomically store @val to @p masked by @msk.
>> + */
>> +static void store_atom_insert_al16(Int128 *ps, Int128Alias val, Int128Alias msk)
>> +{
>> +#if defined(CONFIG_ATOMIC128)
>> +    __uint128_t *pu, old, new;
>> +
>> +    /* With CONFIG_ATOMIC128, we can avoid the memory barriers. */
>> +    pu = __builtin_assume_aligned(ps, 16);
>> +    old = *pu;
>> +    do {
>> +        new = (old & ~msk.u) | val.u;
>> +    } while (!__atomic_compare_exchange_n(pu, &old, new, true,
>> +                                          __ATOMIC_RELAXED, __ATOMIC_RELAXED));
>> +#elif defined(CONFIG_CMPXCHG128)
>> +    __uint128_t *pu, old, new;
>> +
>> +    /*
>> +     * Without CONFIG_ATOMIC128, __atomic_compare_exchange_n will always
>> +     * defer to libatomic, so we must use __sync_val_compare_and_swap_16
>> +     * and accept the sequential consistency that comes with it.
>> +     */
>> +    pu = __builtin_assume_aligned(ps, 16);
>> +    do {
>> +        old = *pu;
>> +        new = (old & ~msk.u) | val.u;
>> +    } while (!__sync_bool_compare_and_swap_16(pu, old, new));
> 
> Comment says "__sync_val..." but code says "__sync_bool...". Which is right?

Fixed the comment to __sync_*_compare_and_swap_16


r~


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 56/57] tcg/ppc: Support 128-bit load/store
  2023-05-03  7:06 ` [PATCH v4 56/57] tcg/ppc: " Richard Henderson
@ 2023-05-08 12:16   ` Daniel Henrique Barboza
  0 siblings, 0 replies; 132+ messages in thread
From: Daniel Henrique Barboza @ 2023-05-08 12:16 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel
  Cc: git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x



On 5/3/23 04:06, Richard Henderson wrote:
> Use LQ/STQ with ISA v2.07, and 16-byte atomicity is required.
> Note that these instructions do not require 16-byte alignment.
> 
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---


Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

>   tcg/ppc/tcg-target-con-set.h |   2 +
>   tcg/ppc/tcg-target-con-str.h |   1 +
>   tcg/ppc/tcg-target.h         |   3 +-
>   tcg/ppc/tcg-target.c.inc     | 173 +++++++++++++++++++++++++++++++----
>   4 files changed, 158 insertions(+), 21 deletions(-)
> 
> diff --git a/tcg/ppc/tcg-target-con-set.h b/tcg/ppc/tcg-target-con-set.h
> index f206b29205..bbd7b21247 100644
> --- a/tcg/ppc/tcg-target-con-set.h
> +++ b/tcg/ppc/tcg-target-con-set.h
> @@ -14,6 +14,7 @@ C_O0_I2(r, r)
>   C_O0_I2(r, ri)
>   C_O0_I2(v, r)
>   C_O0_I3(r, r, r)
> +C_O0_I3(o, m, r)
>   C_O0_I4(r, r, ri, ri)
>   C_O0_I4(r, r, r, r)
>   C_O1_I1(r, r)
> @@ -34,6 +35,7 @@ C_O1_I3(v, v, v, v)
>   C_O1_I4(r, r, ri, rZ, rZ)
>   C_O1_I4(r, r, r, ri, ri)
>   C_O2_I1(r, r, r)
> +C_O2_I1(o, m, r)
>   C_O2_I2(r, r, r, r)
>   C_O2_I4(r, r, rI, rZM, r, r)
>   C_O2_I4(r, r, r, r, rI, rZM)
> diff --git a/tcg/ppc/tcg-target-con-str.h b/tcg/ppc/tcg-target-con-str.h
> index 094613cbcb..20846901de 100644
> --- a/tcg/ppc/tcg-target-con-str.h
> +++ b/tcg/ppc/tcg-target-con-str.h
> @@ -9,6 +9,7 @@
>    * REGS(letter, register_mask)
>    */
>   REGS('r', ALL_GENERAL_REGS)
> +REGS('o', ALL_GENERAL_REGS & 0xAAAAAAAAu)  /* odd registers */
>   REGS('v', ALL_VECTOR_REGS)
>   
>   /*
> diff --git a/tcg/ppc/tcg-target.h b/tcg/ppc/tcg-target.h
> index 0914380bd7..204b70f86a 100644
> --- a/tcg/ppc/tcg-target.h
> +++ b/tcg/ppc/tcg-target.h
> @@ -149,7 +149,8 @@ extern bool have_vsx;
>   #define TCG_TARGET_HAS_mulsh_i64        1
>   #endif
>   
> -#define TCG_TARGET_HAS_qemu_ldst_i128   0
> +#define TCG_TARGET_HAS_qemu_ldst_i128   \
> +    (TCG_TARGET_REG_BITS == 64 && have_isa_2_07)
>   
>   /*
>    * While technically Altivec could support V64, it has no 64-bit store
> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
> index 60375804cd..682743a466 100644
> --- a/tcg/ppc/tcg-target.c.inc
> +++ b/tcg/ppc/tcg-target.c.inc
> @@ -295,25 +295,27 @@ static bool tcg_target_const_match(int64_t val, TCGType type, int ct)
>   
>   #define B      OPCD( 18)
>   #define BC     OPCD( 16)
> +
>   #define LBZ    OPCD( 34)
>   #define LHZ    OPCD( 40)
>   #define LHA    OPCD( 42)
>   #define LWZ    OPCD( 32)
>   #define LWZUX  XO31( 55)
> -#define STB    OPCD( 38)
> -#define STH    OPCD( 44)
> -#define STW    OPCD( 36)
> -
> -#define STD    XO62(  0)
> -#define STDU   XO62(  1)
> -#define STDX   XO31(149)
> -
>   #define LD     XO58(  0)
>   #define LDX    XO31( 21)
>   #define LDU    XO58(  1)
>   #define LDUX   XO31( 53)
>   #define LWA    XO58(  2)
>   #define LWAX   XO31(341)
> +#define LQ     OPCD( 56)
> +
> +#define STB    OPCD( 38)
> +#define STH    OPCD( 44)
> +#define STW    OPCD( 36)
> +#define STD    XO62(  0)
> +#define STDU   XO62(  1)
> +#define STDX   XO31(149)
> +#define STQ    XO62(  2)
>   
>   #define ADDIC  OPCD( 12)
>   #define ADDI   OPCD( 14)
> @@ -2015,11 +2017,25 @@ static bool tcg_out_qemu_st_slow_path(TCGContext *s, TCGLabelQemuLdst *lb)
>   typedef struct {
>       TCGReg base;
>       TCGReg index;
> +    MemOp align;
> +    MemOp atom;
>   } HostAddress;
>   
>   bool tcg_target_has_memory_bswap(MemOp memop)
>   {
> -    return true;
> +    MemOp atom_a, atom_u;
> +
> +    if ((memop & MO_SIZE) <= MO_64) {
> +        return true;
> +    }
> +
> +    /*
> +     * Reject 16-byte memop with 16-byte atomicity,
> +     * but do allow a pair of 64-bit operations.
> +     */
> +    (void)atom_and_align_for_opc(tcg_ctx, &atom_a, &atom_u, memop,
> +                                 MO_ATOM_IFALIGN, true);
> +    return atom_a <= MO_64;
>   }
>   
>   /*
> @@ -2034,7 +2050,7 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>   {
>       TCGLabelQemuLdst *ldst = NULL;
>       MemOp opc = get_memop(oi);
> -    MemOp a_bits, atom_a, atom_u;
> +    MemOp a_bits, atom_u, s_bits;
>   
>       /*
>        * Book II, Section 1.4, Single-Copy Atomicity, specifies:
> @@ -2046,10 +2062,19 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>        * As of 3.0, "the non-atomic access is performed as described in
>        * the corresponding list", which matches MO_ATOM_SUBALIGN.
>        */
> -    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
> +    s_bits = opc & MO_SIZE;
> +    a_bits = atom_and_align_for_opc(s, &h->atom, &atom_u, opc,
>                                       have_isa_3_00 ? MO_ATOM_SUBALIGN
>                                                     : MO_ATOM_IFALIGN,
> -                                    false);
> +                                    s_bits == MO_128);
> +
> +    if (TCG_TARGET_REG_BITS == 32) {
> +        /* We don't support unaligned accesses on 32-bits. */
> +        if (a_bits < s_bits) {
> +            a_bits = s_bits;
> +        }
> +    }
> +    h->align = a_bits;
>   
>   #ifdef CONFIG_SOFTMMU
>       int mem_index = get_mmuidx(oi);
> @@ -2058,7 +2083,6 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>       int fast_off = TLB_MASK_TABLE_OFS(mem_index);
>       int mask_off = fast_off + offsetof(CPUTLBDescFast, mask);
>       int table_off = fast_off + offsetof(CPUTLBDescFast, table);
> -    unsigned s_bits = opc & MO_SIZE;
>   
>       ldst = new_ldst_label(s);
>       ldst->is_ld = is_ld;
> @@ -2108,13 +2132,6 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>   
>       /* Clear the non-page, non-alignment bits from the address in R0. */
>       if (TCG_TARGET_REG_BITS == 32) {
> -        /* We don't support unaligned accesses on 32-bits.
> -         * Preserve the bottom bits and thus trigger a comparison
> -         * failure on unaligned accesses.
> -         */
> -        if (a_bits < s_bits) {
> -            a_bits = s_bits;
> -        }
>           tcg_out_rlw(s, RLWINM, TCG_REG_R0, addrlo, 0,
>                       (32 - a_bits) & 31, 31 - TARGET_PAGE_BITS);
>       } else {
> @@ -2299,6 +2316,108 @@ static void tcg_out_qemu_st(TCGContext *s, TCGReg datalo, TCGReg datahi,
>       }
>   }
>   
> +static TCGLabelQemuLdst *
> +prepare_host_addr_index_only(TCGContext *s, HostAddress *h, TCGReg addr_reg,
> +                             MemOpIdx oi, bool is_ld)
> +{
> +    TCGLabelQemuLdst *ldst;
> +
> +    ldst = prepare_host_addr(s, h, addr_reg, -1, oi, true);
> +
> +    /* Compose the final address, as LQ/STQ have no indexing. */
> +    if (h->base != 0) {
> +        tcg_out32(s, ADD | TAB(TCG_REG_TMP1, h->base, h->index));
> +        h->index = TCG_REG_TMP1;
> +        h->base = 0;
> +    }
> +
> +    return ldst;
> +}
> +
> +static void tcg_out_qemu_ld128(TCGContext *s, TCGReg datalo, TCGReg datahi,
> +                               TCGReg addr_reg, MemOpIdx oi)
> +{
> +    TCGLabelQemuLdst *ldst;
> +    HostAddress h;
> +    bool need_bswap;
> +
> +    ldst = prepare_host_addr_index_only(s, &h, addr_reg, oi, true);
> +    need_bswap = get_memop(oi) & MO_BSWAP;
> +
> +    if (h.atom == MO_128) {
> +        tcg_debug_assert(!need_bswap);
> +        tcg_debug_assert(datalo & 1);
> +        tcg_debug_assert(datahi == datalo - 1);
> +        tcg_out32(s, LQ | TAI(datahi, h.index, 0));
> +    } else {
> +        TCGReg d1, d2;
> +
> +        if (HOST_BIG_ENDIAN ^ need_bswap) {
> +            d1 = datahi, d2 = datalo;
> +        } else {
> +            d1 = datalo, d2 = datahi;
> +        }
> +
> +        if (need_bswap) {
> +            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
> +            tcg_out32(s, LDBRX | TAB(d1, 0, h.index));
> +            tcg_out32(s, LDBRX | TAB(d2, h.index, TCG_REG_R0));
> +        } else {
> +            tcg_out32(s, LD | TAI(d1, h.index, 0));
> +            tcg_out32(s, LD | TAI(d2, h.index, 8));
> +        }
> +    }
> +
> +    if (ldst) {
> +        ldst->type = TCG_TYPE_I128;
> +        ldst->datalo_reg = datalo;
> +        ldst->datahi_reg = datahi;
> +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
> +    }
> +}
> +
> +static void tcg_out_qemu_st128(TCGContext *s, TCGReg datalo, TCGReg datahi,
> +                               TCGReg addr_reg, MemOpIdx oi)
> +{
> +    TCGLabelQemuLdst *ldst;
> +    HostAddress h;
> +    bool need_bswap;
> +
> +    ldst = prepare_host_addr_index_only(s, &h, addr_reg, oi, false);
> +    need_bswap = get_memop(oi) & MO_BSWAP;
> +
> +    if (h.atom == MO_128) {
> +        tcg_debug_assert(!need_bswap);
> +        tcg_debug_assert(datalo & 1);
> +        tcg_debug_assert(datahi == datalo - 1);
> +        tcg_out32(s, STQ | TAI(datahi, h.index, 0));
> +    } else {
> +        TCGReg d1, d2;
> +
> +        if (HOST_BIG_ENDIAN ^ need_bswap) {
> +            d1 = datahi, d2 = datalo;
> +        } else {
> +            d1 = datalo, d2 = datahi;
> +        }
> +
> +        if (need_bswap) {
> +            tcg_out_movi(s, TCG_TYPE_PTR, TCG_REG_R0, 8);
> +            tcg_out32(s, STDBRX | TAB(d1, 0, h.index));
> +            tcg_out32(s, STDBRX | TAB(d2, h.index, TCG_REG_R0));
> +        } else {
> +            tcg_out32(s, STD | TAI(d1, h.index, 0));
> +            tcg_out32(s, STD | TAI(d2, h.index, 8));
> +        }
> +    }
> +
> +    if (ldst) {
> +        ldst->type = TCG_TYPE_I128;
> +        ldst->datalo_reg = datalo;
> +        ldst->datahi_reg = datahi;
> +        ldst->raddr = tcg_splitwx_to_rx(s->code_ptr);
> +    }
> +}
> +
>   static void tcg_out_nop_fill(tcg_insn_unit *p, int count)
>   {
>       int i;
> @@ -2849,6 +2968,11 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
>                               args[4], TCG_TYPE_I64);
>           }
>           break;
> +    case INDEX_op_qemu_ld_i128:
> +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
> +        tcg_out_qemu_ld128(s, args[0], args[1], args[2], args[3]);
> +        break;
> +
>       case INDEX_op_qemu_st_i32:
>           if (TCG_TARGET_REG_BITS >= TARGET_LONG_BITS) {
>               tcg_out_qemu_st(s, args[0], -1, args[1], -1,
> @@ -2870,6 +2994,10 @@ static void tcg_out_op(TCGContext *s, TCGOpcode opc,
>                               args[4], TCG_TYPE_I64);
>           }
>           break;
> +    case INDEX_op_qemu_st_i128:
> +        tcg_debug_assert(TCG_TARGET_REG_BITS == 64);
> +        tcg_out_qemu_st128(s, args[0], args[1], args[2], args[3]);
> +        break;
>   
>       case INDEX_op_setcond_i32:
>           tcg_out_setcond(s, TCG_TYPE_I32, args[3], args[0], args[1], args[2],
> @@ -3705,6 +3833,11 @@ static TCGConstraintSetIndex tcg_target_op_def(TCGOpcode op)
>                   : TARGET_LONG_BITS == 32 ? C_O0_I3(r, r, r)
>                   : C_O0_I4(r, r, r, r));
>   
> +    case INDEX_op_qemu_ld_i128:
> +        return C_O2_I1(o, m, r);
> +    case INDEX_op_qemu_st_i128:
> +        return C_O0_I3(o, m, r);
> +
>       case INDEX_op_add_vec:
>       case INDEX_op_sub_vec:
>       case INDEX_op_mul_vec:


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 14/57] tcg/i386: Add have_atomic16
  2023-05-05 10:34   ` Peter Maydell
@ 2023-05-08 13:41     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 13:41 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 11:34, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:10, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Notice when Intel or AMD have guaranteed that vmovdqa is atomic.
>> The new variable will also be used in generated code.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   include/qemu/cpuid.h      | 18 ++++++++++++++++++
>>   tcg/i386/tcg-target.h     |  1 +
>>   tcg/i386/tcg-target.c.inc | 27 +++++++++++++++++++++++++++
>>   3 files changed, 46 insertions(+)
>>
>> diff --git a/include/qemu/cpuid.h b/include/qemu/cpuid.h
>> index 1451e8ef2f..35325f1995 100644
>> --- a/include/qemu/cpuid.h
>> +++ b/include/qemu/cpuid.h
>> @@ -71,6 +71,24 @@
>>   #define bit_LZCNT       (1 << 5)
>>   #endif
>>
>> +/*
>> + * Signatures for different CPU implementations as returned from Leaf 0.
>> + */
>> +
>> +#ifndef signature_INTEL_ecx
>> +/* "Genu" "ineI" "ntel" */
>> +#define signature_INTEL_ebx     0x756e6547
>> +#define signature_INTEL_edx     0x49656e69
>> +#define signature_INTEL_ecx     0x6c65746e
>> +#endif
>> +
>> +#ifndef signature_AMD_ecx
>> +/* "Auth" "enti" "cAMD" */
>> +#define signature_AMD_ebx       0x68747541
>> +#define signature_AMD_edx       0x69746e65
>> +#define signature_AMD_ecx       0x444d4163
>> +#endif
> 
>> @@ -4024,6 +4025,32 @@ static void tcg_target_init(TCGContext *s)
>>                       have_avx512dq = (b7 & bit_AVX512DQ) != 0;
>>                       have_avx512vbmi2 = (c7 & bit_AVX512VBMI2) != 0;
>>                   }
>> +
>> +                /*
>> +                 * The Intel SDM has added:
>> +                 *   Processors that enumerate support for Intel® AVX
>> +                 *   (by setting the feature flag CPUID.01H:ECX.AVX[bit 28])
>> +                 *   guarantee that the 16-byte memory operations performed
>> +                 *   by the following instructions will always be carried
>> +                 *   out atomically:
>> +                 *   - MOVAPD, MOVAPS, and MOVDQA.
>> +                 *   - VMOVAPD, VMOVAPS, and VMOVDQA when encoded with VEX.128.
>> +                 *   - VMOVAPD, VMOVAPS, VMOVDQA32, and VMOVDQA64 when encoded
>> +                 *     with EVEX.128 and k0 (masking disabled).
>> +                 * Note that these instructions require the linear addresses
>> +                 * of their memory operands to be 16-byte aligned.
>> +                 *
>> +                 * AMD has provided an even stronger guarantee that processors
>> +                 * with AVX provide 16-byte atomicity for all cachable,
>> +                 * naturally aligned single loads and stores, e.g. MOVDQU.
>> +                 *
>> +                 * See https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
>> +                 */
>> +                if (have_avx1) {
>> +                    __cpuid(0, a, b, c, d);
>> +                    have_atomic16 = (c == signature_INTEL_ecx ||
>> +                                     c == signature_AMD_ecx);
>> +                }
> 
> If the signature is 3 words why are we only checking one here ?

Because one is sufficient.  I don't know why the signature is 3 words and not 1.


r~


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc
  2023-05-05 10:37   ` Peter Maydell
@ 2023-05-08 13:48     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 13:48 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 11:37, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:08, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Hosts using Intel and AMD AVX cpus are quite common.
>> Add fast paths through ldst_atomicity using this.
>>
>> Only enable with CONFIG_INT128; some older clang versions do not
>> support __int128_t, and the inline assembly won't work on structures.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   accel/tcg/ldst_atomicity.c.inc | 76 +++++++++++++++++++++++++++-------
>>   1 file changed, 60 insertions(+), 16 deletions(-)
> 
> Inline x86 asm in a bit of generic code seems rather awkward.
> Ideally the compiler should be doing this for us; failing
> that can we at least abstract out the operations to a
> set of functions that we can provide (or not provide)
> implementations of?

This was the best I could come up with, because of the fall through into other generic 
code.  Specific suggestions welcome.


r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 26/57] tcg/arm: Adjust constraints on qemu_ld/st
  2023-05-05 12:14   ` Peter Maydell
@ 2023-05-08 15:13     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 15:13 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 13:14, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:10, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Always reserve r3 for tlb softmmu lookup.  Fix a bug in user-only
>> ALL_QLDST_REGS, in that r14 is clobbered by the BLNE that leads
>> to the misaligned trap.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
> 
>>   /*
>> - * r0-r2 will be overwritten when reading the tlb entry (softmmu only)
>> - * and r0-r1 doing the byte swapping, so don't use these.
>> - * r3 is removed for softmmu to avoid clashes with helper arguments.
>> + * r0-r3 will be overwritten when reading the tlb entry (softmmu only);
>> + * r14 will be overwritten by the BLNE branching to the slow path.
>>    */
>>   #ifdef CONFIG_SOFTMMU
>> -#define ALL_QLOAD_REGS \
>> +#define ALL_QLDST_REGS \
>>       (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1) | \
>>                             (1 << TCG_REG_R2) | (1 << TCG_REG_R3) | \
>>                             (1 << TCG_REG_R14)))
>> -#define ALL_QSTORE_REGS \
>> -    (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1) | \
>> -                          (1 << TCG_REG_R2) | (1 << TCG_REG_R14) | \
>> -                          ((TARGET_LONG_BITS == 64) << TCG_REG_R3)))
>>   #else
>> -#define ALL_QLOAD_REGS   ALL_GENERAL_REGS
>> -#define ALL_QSTORE_REGS \
>> -    (ALL_GENERAL_REGS & ~((1 << TCG_REG_R0) | (1 << TCG_REG_R1)))
>> +#define ALL_QLDST_REGS   (ALL_GENERAL_REGS & ~(1 << TCG_REG_R14))
>>   #endif
> 
> Why is it OK not to remove r0 and r1 from this any more ?
> The commit message doesn't say anything about this bit of the change.

I'm not 100% sure why they were included.  Perhaps bswap, from the old days where that was 
required of the backend.


r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary
  2023-05-05 12:19   ` Peter Maydell
@ 2023-05-08 15:17     ` Richard Henderson
  2023-05-09  9:24       ` Peter Maydell
  0 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 15:17 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 13:19, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:17, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/sparc64/tcg-target.c.inc | 15 +++++++--------
>>   1 file changed, 7 insertions(+), 8 deletions(-)
>>
>> diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
>> index e997db2645..64464ab363 100644
>> --- a/tcg/sparc64/tcg-target.c.inc
>> +++ b/tcg/sparc64/tcg-target.c.inc
>> @@ -83,9 +83,10 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>>   #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
>>   #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
>>
>> -/* Define some temporary registers.  T2 is used for constant generation.  */
>> +/* Define some temporary registers.  T3 is used for constant generation.  */
>>   #define TCG_REG_T1  TCG_REG_G1
>> -#define TCG_REG_T2  TCG_REG_O7
>> +#define TCG_REG_T2  TCG_REG_G2
>> +#define TCG_REG_T3  TCG_REG_O7
>>
>>   #ifndef CONFIG_SOFTMMU
>>   # define TCG_GUEST_BASE_REG TCG_REG_I5
>> @@ -110,7 +111,6 @@ static const int tcg_target_reg_alloc_order[] = {
>>       TCG_REG_I4,
>>       TCG_REG_I5,
>>
>> -    TCG_REG_G2,
>>       TCG_REG_G3,
>>       TCG_REG_G4,
>>       TCG_REG_G5,
>> @@ -492,8 +492,8 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>>   static void tcg_out_movi(TCGContext *s, TCGType type,
>>                            TCGReg ret, tcg_target_long arg)
>>   {
>> -    tcg_debug_assert(ret != TCG_REG_T2);
>> -    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
>> +    tcg_debug_assert(ret != TCG_REG_T3);
>> +    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T3);
>>   }
> 
> Why do we need to change this usage of TCG_REG_T2 but not
> any of the others ?

To match the comment above.


r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode
  2023-05-05 13:27   ` Peter Maydell
@ 2023-05-08 16:15     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 16:15 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 14:27, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:18, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Use the fpu to perform 64-bit loads and stores.
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> 
> 
>> @@ -2091,7 +2095,20 @@ static void tcg_out_qemu_ld_direct(TCGContext *s, TCGReg datalo, TCGReg datahi,
>>               datalo = datahi;
>>               datahi = t;
>>           }
>> -        if (h.base == datalo || h.index == datalo) {
>> +        if (h.atom == MO_64) {
>> +            /*
>> +             * Atomicity requires that we use use a single 8-byte load.
>> +             * For simplicity and code size, always use the FPU for this.
>> +             * Similar insns using SSE/AVX are merely larger.
> 
> I'm surprised there's no performance penalty for throwing old-school
> FPU insns into what is presumably otherwise code that's only
> using modern SSE.

I have no idea about performance.  We don't require SSE for TCG at the moment.

> I assume the caller has arranged that the top of the stack
> is trashable at this point?

The entire fpu stack is call-clobbered.


r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 48/57] tcg/ppc: Use atom_and_align_for_opc
  2023-05-05 13:18   ` Peter Maydell
@ 2023-05-08 17:32     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 17:32 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 14:18, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:13, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/ppc/tcg-target.c.inc | 17 ++++++++++++++++-
>>   1 file changed, 16 insertions(+), 1 deletion(-)
>>
>> diff --git a/tcg/ppc/tcg-target.c.inc b/tcg/ppc/tcg-target.c.inc
>> index f0a4118bbb..60375804cd 100644
>> --- a/tcg/ppc/tcg-target.c.inc
>> +++ b/tcg/ppc/tcg-target.c.inc
>> @@ -2034,7 +2034,22 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>>   {
>>       TCGLabelQemuLdst *ldst = NULL;
>>       MemOp opc = get_memop(oi);
>> -    unsigned a_bits = get_alignment_bits(opc);
>> +    MemOp a_bits, atom_a, atom_u;
>> +
>> +    /*
>> +     * Book II, Section 1.4, Single-Copy Atomicity, specifies:
>> +     *
>> +     * Before 3.0, "An access that is not atomic is performed as a set of
>> +     * smaller disjoint atomic accesses. In general, the number and alignment
>> +     * of these accesses are implementation-dependent."  Thus MO_ATOM_IFALIGN.
>> +     *
>> +     * As of 3.0, "the non-atomic access is performed as described in
>> +     * the corresponding list", which matches MO_ATOM_SUBALIGN.
>> +     */
>> +    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
>> +                                    have_isa_3_00 ? MO_ATOM_SUBALIGN
>> +                                                  : MO_ATOM_IFALIGN,
>> +                                    false);
>>
> 
> Why doesn't this patch have changes to a HostAddress struct
> like all the other archs ?

Because the alignment as only required here, within prepare_host_addr.
The Power LQ instruction allows unaligned input, unlike x86 VMOVDQA.


r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 49/57] tcg/riscv: Use atom_and_align_for_opc
  2023-05-05 13:19   ` Peter Maydell
@ 2023-05-08 17:33     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 17:33 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 14:19, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:13, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/riscv/tcg-target.c.inc | 8 ++++++--
>>   1 file changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/tcg/riscv/tcg-target.c.inc b/tcg/riscv/tcg-target.c.inc
>> index 37870c89fc..4dd33c73e8 100644
>> --- a/tcg/riscv/tcg-target.c.inc
>> +++ b/tcg/riscv/tcg-target.c.inc
>> @@ -910,8 +910,12 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, TCGReg *pbase,
>>   {
>>       TCGLabelQemuLdst *ldst = NULL;
>>       MemOp opc = get_memop(oi);
>> -    unsigned a_bits = get_alignment_bits(opc);
>> -    unsigned a_mask = (1u << a_bits) - 1;
>> +    MemOp a_bits, atom_a, atom_u;
>> +    unsigned a_mask;
>> +
>> +    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
>> +                                    MO_ATOM_IFALIGN, false);
>> +    a_mask = (1u << a_bits) - 1;
>>
>>   #ifdef CONFIG_SOFTMMU
>>       unsigned s_bits = opc & MO_SIZE;
> 
> Same remark as for ppc.

Because the alignment was not required outside of prepare_host_addr.
RISC-V does not have 128-bit memory operations of any kind.


r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 51/57] tcg/sparc64: Use atom_and_align_for_opc
  2023-05-05 13:20   ` Peter Maydell
@ 2023-05-08 17:34     ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-08 17:34 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/5/23 14:20, Peter Maydell wrote:
> On Wed, 3 May 2023 at 08:13, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>> ---
>>   tcg/sparc64/tcg-target.c.inc | 6 ++++--
>>   1 file changed, 4 insertions(+), 2 deletions(-)
>>
>> diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
>> index bb23038529..4f9ec02b1f 100644
>> --- a/tcg/sparc64/tcg-target.c.inc
>> +++ b/tcg/sparc64/tcg-target.c.inc
>> @@ -1028,11 +1028,13 @@ static TCGLabelQemuLdst *prepare_host_addr(TCGContext *s, HostAddress *h,
>>   {
>>       TCGLabelQemuLdst *ldst = NULL;
>>       MemOp opc = get_memop(oi);
>> -    unsigned a_bits = get_alignment_bits(opc);
>> -    unsigned s_bits = opc & MO_SIZE;
>> +    MemOp s_bits = opc & MO_SIZE;
>> +    MemOp a_bits, atom_a, atom_u;
>>       unsigned a_mask;
>>
>>       /* We don't support unaligned accesses. */
>> +    a_bits = atom_and_align_for_opc(s, &atom_a, &atom_u, opc,
>> +                                    MO_ATOM_IFALIGN, false);
>>       a_bits = MAX(a_bits, s_bits);
>>       a_mask = (1u << a_bits) - 1;
>>
>> --
> 
> No changes to HostAddress struct again?

Again, no use of alignment outside of prepare_host_addr.
No 128-bit operations, and all host operations aligned.


r~



^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary
  2023-05-08 15:17     ` Richard Henderson
@ 2023-05-09  9:24       ` Peter Maydell
  2023-05-09 14:34         ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-09  9:24 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On Mon, 8 May 2023 at 16:17, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/5/23 13:19, Peter Maydell wrote:
> > On Wed, 3 May 2023 at 08:17, Richard Henderson
> > <richard.henderson@linaro.org> wrote:
> >>
> >> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> >> ---
> >>   tcg/sparc64/tcg-target.c.inc | 15 +++++++--------
> >>   1 file changed, 7 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
> >> index e997db2645..64464ab363 100644
> >> --- a/tcg/sparc64/tcg-target.c.inc
> >> +++ b/tcg/sparc64/tcg-target.c.inc
> >> @@ -83,9 +83,10 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
> >>   #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
> >>   #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
> >>
> >> -/* Define some temporary registers.  T2 is used for constant generation.  */
> >> +/* Define some temporary registers.  T3 is used for constant generation.  */
> >>   #define TCG_REG_T1  TCG_REG_G1
> >> -#define TCG_REG_T2  TCG_REG_O7
> >> +#define TCG_REG_T2  TCG_REG_G2
> >> +#define TCG_REG_T3  TCG_REG_O7
> >>
> >>   #ifndef CONFIG_SOFTMMU
> >>   # define TCG_GUEST_BASE_REG TCG_REG_I5
> >> @@ -110,7 +111,6 @@ static const int tcg_target_reg_alloc_order[] = {
> >>       TCG_REG_I4,
> >>       TCG_REG_I5,
> >>
> >> -    TCG_REG_G2,
> >>       TCG_REG_G3,
> >>       TCG_REG_G4,
> >>       TCG_REG_G5,
> >> @@ -492,8 +492,8 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
> >>   static void tcg_out_movi(TCGContext *s, TCGType type,
> >>                            TCGReg ret, tcg_target_long arg)
> >>   {
> >> -    tcg_debug_assert(ret != TCG_REG_T2);
> >> -    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
> >> +    tcg_debug_assert(ret != TCG_REG_T3);
> >> +    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T3);
> >>   }
> >
> > Why do we need to change this usage of TCG_REG_T2 but not
> > any of the others ?
>
> To match the comment above.

To expand, what I mean is "when I'm reviewing this patch, what
do I need to know in order to know whether any particular
instance of TCG_REG_T2 should be changed to _T3 or not?".
All the sites where we *don't* change T2 to T3 are now
using a different register, so there is presumably some
logic for how we tell whether that's safe or not. The
"no behaviour change" option would be to change all of them.

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 06/57] accel/tcg: Honor atomicity of loads
  2023-05-05 20:19     ` Richard Henderson
@ 2023-05-09 12:04       ` Peter Maydell
  2023-05-09 14:27         ` Richard Henderson
  0 siblings, 1 reply; 132+ messages in thread
From: Peter Maydell @ 2023-05-09 12:04 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On Fri, 5 May 2023 at 21:19, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/4/23 18:17, Peter Maydell wrote:
> >> +    case MO_ATOM_WITHIN16:
> >> +        tmp = p & 15;
> >> +        if (tmp + (1 << size) <= 16) {
> >> +            atmax = size;
> >
> > OK, so this is "whole operation is within 16 bytes,
> > whole operation must be atomic"...
> >
> >> +        } else if (atmax == size) {
> >> +            return MO_8;
> >
> > ...but I don't understand the interaction of WITHIN16
> > and also specifying an ATMAX value that's not ATMAX_SIZE.
>
> I'm trying to describe e.g. LDP, which if not within16 has two 8-byte elements, one or
> both of which must be atomic.  We will have set MO_ATOM_WITHIN16 | MO_ATMAX_8.
>
> If atmax == size, there is only one element, and since it is not within16, there is no
> atomicity.
>
> >> +        } else if (tmp + (1 << atmax) != 16) {
> >
> > Why is this doing an exact inequality check?
> > What if you're asking for a load of 8 bytes at
> > MO_ATMAX_2 from a pointer that's at an offset of
> > 10 bytes from a 16-byte boundary? Then tmp is 10,
> > tmp + (1 << atmax) is 12, but we could still do the
> > loads at atomicity 2. This doesn't seem to me to be
> > any different from the case it does catch where
> > the first ATMAX_2-sized unit happens to be the only
> > thing in this 16-byte block.
>
> If the LDP is aligned mod 8, but not aligned mod 16, then both 8-byte operations must be
> (separately) atomic, and we return MO_64.

So there's an implicit "at most 2 atomic sub-operations
inside a WITHIN16 load" restriction? i.e. you can't
use WITHIN16 to say "do this 8 byte load atomically but
if it's not in a 16-byte region do it with 4 2-byte loads",
even though in theory MO_ATOM_WITHIN16 | MO_ATMAX_2 | MO_8
would describe that ?

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 06/57] accel/tcg: Honor atomicity of loads
  2023-05-09 12:04       ` Peter Maydell
@ 2023-05-09 14:27         ` Richard Henderson
  2023-05-09 14:33           ` Peter Maydell
  0 siblings, 1 reply; 132+ messages in thread
From: Richard Henderson @ 2023-05-09 14:27 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On 5/9/23 13:04, Peter Maydell wrote:
>> If the LDP is aligned mod 8, but not aligned mod 16, then both 8-byte operations must be
>> (separately) atomic, and we return MO_64.
> 
> So there's an implicit "at most 2 atomic sub-operations
> inside a WITHIN16 load" restriction? i.e. you can't
> use WITHIN16 to say "do this 8 byte load atomically but
> if it's not in a 16-byte region do it with 4 2-byte loads",
> even though in theory MO_ATOM_WITHIN16 | MO_ATMAX_2 | MO_8
> would describe that ?

Correct on both counts.  While you're right that this is a valid generalization, it's not 
something for which I've found a use case.


r~


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 06/57] accel/tcg: Honor atomicity of loads
  2023-05-09 14:27         ` Richard Henderson
@ 2023-05-09 14:33           ` Peter Maydell
  0 siblings, 0 replies; 132+ messages in thread
From: Peter Maydell @ 2023-05-09 14:33 UTC (permalink / raw)
  To: Richard Henderson
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv,
	qemu-s390x, Alex Bennée

On Tue, 9 May 2023 at 15:27, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> On 5/9/23 13:04, Peter Maydell wrote:
> >> If the LDP is aligned mod 8, but not aligned mod 16, then both 8-byte operations must be
> >> (separately) atomic, and we return MO_64.
> >
> > So there's an implicit "at most 2 atomic sub-operations
> > inside a WITHIN16 load" restriction? i.e. you can't
> > use WITHIN16 to say "do this 8 byte load atomically but
> > if it's not in a 16-byte region do it with 4 2-byte loads",
> > even though in theory MO_ATOM_WITHIN16 | MO_ATMAX_2 | MO_8
> > would describe that ?
>
> Correct on both counts.  While you're right that this is a valid generalization, it's not
> something for which I've found a use case.

Yeah, that's fine -- but we should note the restrictions/
requirements in the doc comment for WITHIN16. (And maybe
an assert somewhere if there's somewhere convenient to
put it?)

thanks
-- PMM


^ permalink raw reply	[flat|nested] 132+ messages in thread

* Re: [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary
  2023-05-09  9:24       ` Peter Maydell
@ 2023-05-09 14:34         ` Richard Henderson
  0 siblings, 0 replies; 132+ messages in thread
From: Richard Henderson @ 2023-05-09 14:34 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, git, gaosong, philmd, qemu-arm, qemu-riscv, qemu-s390x

On 5/9/23 10:24, Peter Maydell wrote:
> On Mon, 8 May 2023 at 16:17, Richard Henderson
> <richard.henderson@linaro.org> wrote:
>>
>> On 5/5/23 13:19, Peter Maydell wrote:
>>> On Wed, 3 May 2023 at 08:17, Richard Henderson
>>> <richard.henderson@linaro.org> wrote:
>>>>
>>>> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
>>>> ---
>>>>    tcg/sparc64/tcg-target.c.inc | 15 +++++++--------
>>>>    1 file changed, 7 insertions(+), 8 deletions(-)
>>>>
>>>> diff --git a/tcg/sparc64/tcg-target.c.inc b/tcg/sparc64/tcg-target.c.inc
>>>> index e997db2645..64464ab363 100644
>>>> --- a/tcg/sparc64/tcg-target.c.inc
>>>> +++ b/tcg/sparc64/tcg-target.c.inc
>>>> @@ -83,9 +83,10 @@ static const char * const tcg_target_reg_names[TCG_TARGET_NB_REGS] = {
>>>>    #define ALL_GENERAL_REGS     MAKE_64BIT_MASK(0, 32)
>>>>    #define ALL_QLDST_REGS       (ALL_GENERAL_REGS & ~SOFTMMU_RESERVE_REGS)
>>>>
>>>> -/* Define some temporary registers.  T2 is used for constant generation.  */
>>>> +/* Define some temporary registers.  T3 is used for constant generation.  */
>>>>    #define TCG_REG_T1  TCG_REG_G1
>>>> -#define TCG_REG_T2  TCG_REG_O7
>>>> +#define TCG_REG_T2  TCG_REG_G2
>>>> +#define TCG_REG_T3  TCG_REG_O7
>>>>
>>>>    #ifndef CONFIG_SOFTMMU
>>>>    # define TCG_GUEST_BASE_REG TCG_REG_I5
>>>> @@ -110,7 +111,6 @@ static const int tcg_target_reg_alloc_order[] = {
>>>>        TCG_REG_I4,
>>>>        TCG_REG_I5,
>>>>
>>>> -    TCG_REG_G2,
>>>>        TCG_REG_G3,
>>>>        TCG_REG_G4,
>>>>        TCG_REG_G5,
>>>> @@ -492,8 +492,8 @@ static void tcg_out_movi_int(TCGContext *s, TCGType type, TCGReg ret,
>>>>    static void tcg_out_movi(TCGContext *s, TCGType type,
>>>>                             TCGReg ret, tcg_target_long arg)
>>>>    {
>>>> -    tcg_debug_assert(ret != TCG_REG_T2);
>>>> -    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T2);
>>>> +    tcg_debug_assert(ret != TCG_REG_T3);
>>>> +    tcg_out_movi_int(s, type, ret, arg, false, TCG_REG_T3);
>>>>    }
>>>
>>> Why do we need to change this usage of TCG_REG_T2 but not
>>> any of the others ?
>>
>> To match the comment above.
> 
> To expand, what I mean is "when I'm reviewing this patch, what
> do I need to know in order to know whether any particular
> instance of TCG_REG_T2 should be changed to _T3 or not?".
> All the sites where we *don't* change T2 to T3 are now
> using a different register, so there is presumably some
> logic for how we tell whether that's safe or not. The
> "no behaviour change" option would be to change all of them.

Oh.  Well, we could change none of them, including the comment, and also be correct. 
There is no conflict anywhere.

Only with patch 34 ("Use standard slow path for softmmu") do we first see all three temps 
in use at the same time.  Moreover, when I wrote this patch I thought there would in fact 
be a conflict with the use of tcg_out_movi within the slow path patch.  Then I found I 
needed to split out tcg_out_movi_s32 (patch 33) so that I could avoid the assert altogether.

Clearer?  Or not?


r~


^ permalink raw reply	[flat|nested] 132+ messages in thread

end of thread, other threads:[~2023-05-09 14:35 UTC | newest]

Thread overview: 132+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-03  7:05 [PATCH v4 00/57] tcg: Improve atomicity support Richard Henderson
2023-05-03  7:06 ` [PATCH v4 01/57] include/exec/memop: Add bits describing atomicity Richard Henderson
2023-05-04 14:49   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 02/57] accel/tcg: Add cpu_in_serial_context Richard Henderson
2023-05-03  7:06 ` [PATCH v4 03/57] accel/tcg: Introduce tlb_read_idx Richard Henderson
2023-05-04 15:02   ` Peter Maydell
2023-05-05 18:57     ` Richard Henderson
2023-05-07 10:09       ` Peter Maydell
2023-05-08 10:02         ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 04/57] accel/tcg: Reorg system mode load helpers Richard Henderson
2023-05-04 15:39   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 05/57] accel/tcg: Reorg system mode store helpers Richard Henderson
2023-05-04 15:44   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 06/57] accel/tcg: Honor atomicity of loads Richard Henderson
2023-05-04 17:17   ` Peter Maydell
2023-05-05 20:19     ` Richard Henderson
2023-05-09 12:04       ` Peter Maydell
2023-05-09 14:27         ` Richard Henderson
2023-05-09 14:33           ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 07/57] accel/tcg: Honor atomicity of stores Richard Henderson
2023-05-05  9:28   ` Peter Maydell
2023-05-08 10:11     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 08/57] target/loongarch: Do not include tcg-ldst.h Richard Henderson
2023-05-05  9:29   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 09/57] tcg: Unify helper_{be,le}_{ld,st}* Richard Henderson
2023-05-05  9:36   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 10/57] accel/tcg: Implement helper_{ld, st}*_mmu for user-only Richard Henderson
2023-05-05  9:43   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 11/57] tcg/tci: Use helper_{ld,st}*_mmu " Richard Henderson
2023-05-05  9:44   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 12/57] tcg: Add 128-bit guest memory primitives Richard Henderson
2023-05-05 10:04   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 13/57] meson: Detect atomic128 support with optimization Richard Henderson
2023-05-05 10:29   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 14/57] tcg/i386: Add have_atomic16 Richard Henderson
2023-05-05 10:34   ` Peter Maydell
2023-05-08 13:41     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 15/57] accel/tcg: Use have_atomic16 in ldst_atomicity.c.inc Richard Henderson
2023-05-05 10:37   ` Peter Maydell
2023-05-08 13:48     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 16/57] accel/tcg: Add aarch64 specific support in ldst_atomicity Richard Henderson
2023-05-03  7:06 ` [PATCH v4 17/57] tcg/aarch64: Detect have_lse, have_lse2 for linux Richard Henderson
2023-05-05 10:41   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 18/57] tcg/aarch64: Detect have_lse, have_lse2 for darwin Richard Henderson
2023-05-05 10:43   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 19/57] accel/tcg: Add have_lse2 support in ldst_atomicity Richard Henderson
2023-05-03  7:06 ` [PATCH v4 20/57] tcg: Introduce TCG_OPF_TYPE_MASK Richard Henderson
2023-05-05 10:45   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 21/57] tcg/i386: Use full load/store helpers in user-only mode Richard Henderson
2023-05-05 12:01   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 22/57] tcg/aarch64: " Richard Henderson
2023-05-05 12:06   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 23/57] tcg/ppc: " Richard Henderson
2023-05-05 12:07   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 24/57] tcg/loongarch64: " Richard Henderson
2023-05-05 12:07   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 25/57] tcg/riscv: " Richard Henderson
2023-05-05 12:07   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 26/57] tcg/arm: Adjust constraints on qemu_ld/st Richard Henderson
2023-05-05 12:14   ` Peter Maydell
2023-05-08 15:13     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 27/57] tcg/arm: Use full load/store helpers in user-only mode Richard Henderson
2023-05-05 12:15   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 28/57] tcg/mips: " Richard Henderson
2023-05-05 12:15   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 29/57] tcg/s390x: " Richard Henderson
2023-05-05 12:16   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 30/57] tcg/sparc64: Allocate %g2 as a third temporary Richard Henderson
2023-05-05 12:19   ` Peter Maydell
2023-05-08 15:17     ` Richard Henderson
2023-05-09  9:24       ` Peter Maydell
2023-05-09 14:34         ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 31/57] tcg/sparc64: Rename tcg_out_movi_imm13 to tcg_out_movi_s13 Richard Henderson
2023-05-05 12:20   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 32/57] tcg/sparc64: Rename tcg_out_movi_imm32 to tcg_out_movi_u32 Richard Henderson
2023-05-05 12:22   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 33/57] tcg/sparc64: Split out tcg_out_movi_s32 Richard Henderson
2023-05-05 12:23   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 34/57] tcg/sparc64: Use standard slow path for softmmu Richard Henderson
2023-05-05 12:26   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 35/57] accel/tcg: Remove helper_unaligned_{ld,st} Richard Henderson
2023-05-05 12:27   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 36/57] tcg/loongarch64: Assert the host supports unaligned accesses Richard Henderson
2023-05-05 12:30   ` Peter Maydell
2023-05-05 13:24   ` WANG Xuerui
2023-05-06  2:03     ` Song Gao
2023-05-03  7:06 ` [PATCH v4 37/57] tcg/loongarch64: Support softmmu " Richard Henderson
2023-05-05 12:35   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 38/57] tcg/riscv: " Richard Henderson
2023-05-05 10:35   ` LIU Zhiwei
2023-05-03  7:06 ` [PATCH v4 39/57] tcg: Introduce tcg_target_has_memory_bswap Richard Henderson
2023-05-05 12:41   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 40/57] tcg: Add INDEX_op_qemu_{ld,st}_i128 Richard Henderson
2023-05-05 12:45   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 41/57] tcg: Support TCG_TYPE_I128 in tcg_out_{ld, st}_helper_{args, ret} Richard Henderson
2023-05-05 12:53   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 42/57] tcg: Introduce atom_and_align_for_opc Richard Henderson
2023-05-05 13:03   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 43/57] tcg/i386: Use atom_and_align_for_opc Richard Henderson
2023-05-05 13:14   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 44/57] tcg/aarch64: " Richard Henderson
2023-05-05 13:15   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 45/57] tcg/arm: " Richard Henderson
2023-05-05 13:15   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 46/57] tcg/loongarch64: " Richard Henderson
2023-05-05 13:16   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 47/57] tcg/mips: " Richard Henderson
2023-05-05 13:17   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 48/57] tcg/ppc: " Richard Henderson
2023-05-05 13:18   ` Peter Maydell
2023-05-08 17:32     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 49/57] tcg/riscv: " Richard Henderson
2023-05-05 13:19   ` Peter Maydell
2023-05-08 17:33     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 50/57] tcg/s390x: " Richard Henderson
2023-05-05 13:20   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 51/57] tcg/sparc64: " Richard Henderson
2023-05-05 13:20   ` Peter Maydell
2023-05-08 17:34     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 52/57] tcg/i386: Honor 64-bit atomicity in 32-bit mode Richard Henderson
2023-05-05 13:27   ` Peter Maydell
2023-05-08 16:15     ` Richard Henderson
2023-05-03  7:06 ` [PATCH v4 53/57] tcg/i386: Support 128-bit load/store with have_atomic16 Richard Henderson
2023-05-05 13:34   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 54/57] tcg/aarch64: Rename temporaries Richard Henderson
2023-05-05 13:36   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 55/57] tcg/aarch64: Support 128-bit load/store Richard Henderson
2023-05-05 13:41   ` Peter Maydell
2023-05-03  7:06 ` [PATCH v4 56/57] tcg/ppc: " Richard Henderson
2023-05-08 12:16   ` Daniel Henrique Barboza
2023-05-03  7:06 ` [PATCH v4 57/57] tcg/s390x: " Richard Henderson
2023-05-05 13:43 ` [PATCH v4 00/57] tcg: Improve atomicity support Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.