Re: [PATCH v6 70/82] target/arm: Implement SVE2 LD1RO

From: Peter Maydell <peter.maydell@linaro.org>
To: Richard Henderson <richard.henderson@linaro.org>
Cc: qemu-arm <qemu-arm@nongnu.org>, QEMU Developers <qemu-devel@nongnu.org>
Subject: Re: [PATCH v6 70/82] target/arm: Implement SVE2 LD1RO
Date: Thu, 13 May 2021 17:41:47 +0100	[thread overview]
Message-ID: <CAFEAcA8sf3M_2pVZPn2AwLO0vdc8PoOwWTtxeYYHTxpUphhkuA@mail.gmail.com> (raw)
In-Reply-To: <20210430202610.1136687-71-richard.henderson@linaro.org>

On Fri, 30 Apr 2021 at 22:31, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
> ---
>  target/arm/sve.decode      |  4 ++
>  target/arm/translate-sve.c | 97 ++++++++++++++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
>
> diff --git a/target/arm/sve.decode b/target/arm/sve.decode
> index 17adb393ff..df870ce23b 100644
> --- a/target/arm/sve.decode
> +++ b/target/arm/sve.decode
> @@ -1077,11 +1077,15 @@ LD_zpri         1010010 .. nreg:2 0.... 111 ... ..... .....     @rpri_load_msz
>  # SVE load and broadcast quadword (scalar plus scalar)
>  LD1RQ_zprr      1010010 .. 00 ..... 000 ... ..... ..... \
>                  @rprr_load_msz nreg=0
> +LD1RO_zprr      1010010 .. 01 ..... 000 ... ..... ..... \
> +                @rprr_load_msz nreg=0
>
>  # SVE load and broadcast quadword (scalar plus immediate)
>  # LD1RQB, LD1RQH, LD1RQS, LD1RQD
>  LD1RQ_zpri      1010010 .. 00 0.... 001 ... ..... ..... \
>                  @rpri_load_msz nreg=0
> +LD1RO_zpri      1010010 .. 01 0.... 001 ... ..... ..... \
> +                @rpri_load_msz nreg=0
>
>  # SVE 32-bit gather prefetch (scalar plus 32-bit scaled offsets)
>  PRF             1000010 00 -1 ----- 0-- --- ----- 0 ----
> diff --git a/target/arm/translate-sve.c b/target/arm/translate-sve.c
> index ca393164bc..8a4eb8542f 100644
> --- a/target/arm/translate-sve.c
> +++ b/target/arm/translate-sve.c
> @@ -5586,6 +5586,103 @@ static bool trans_LD1RQ_zpri(DisasContext *s, arg_rpri_load *a)
>      return true;
>  }
>
> +static void do_ldro(DisasContext *s, int zt, int pg, TCGv_i64 addr, int dtype)
> +{
> +    unsigned vsz = vec_full_reg_size(s);
> +    unsigned vsz_r32;
> +    TCGv_ptr t_pg;
> +    TCGv_i32 t_desc;
> +    int desc, poff, doff;
> +
> +    if (vsz < 32) {
> +        /*
> +         * Note that this UNDEFINED check comes after CheckSVEEnabled()
> +         * in the ARM pseudocode, which is the sve_access_check() done
> +         * in our caller.  We should not now return false from the caller.
> +         */
> +        unallocated_encoding(s);
> +        return;
> +    }
> +
> +    /* Load the first octaword using the normal predicated load helpers.  */
> +
> +    poff = pred_full_reg_offset(s, pg);
> +    if (vsz > 32) {
> +        /*
> +         * Zero-extend the first 32 bits of the predicate into a temporary.
> +         * This avoids triggering an assert making sure we don't have bits
> +         * set within a predicate beyond VQ, but we have lowered VQ to 2
> +         * for this load operation.
> +         */
> +        TCGv_i64 tmp = tcg_temp_new_i64();
> +#ifdef HOST_WORDS_BIGENDIAN
> +        poff += 4;
> +#endif
> +        tcg_gen_ld32u_i64(tmp, cpu_env, poff);
> +
> +        poff = offsetof(CPUARMState, vfp.preg_tmp);
> +        tcg_gen_st_i64(tmp, cpu_env, poff);
> +        tcg_temp_free_i64(tmp);
> +    }
> +
> +    t_pg = tcg_temp_new_ptr();
> +    tcg_gen_addi_ptr(t_pg, cpu_env, poff);
> +
> +    desc = simd_desc(32, 32, zt);
> +    t_desc = tcg_const_i32(desc);

Why put these two lines down here? In do_ldrq() they are higher up...
Unless there's a reason for the two functions to be different we
should keep them the same, I think.

> +
> +    gen_helper_gvec_mem *fn
> +        = ldr_fns[s->mte_active[0]][s->be_data == MO_BE][dtype][0];
> +    fn(cpu_env, t_pg, addr, t_desc);
> +
> +    tcg_temp_free_ptr(t_pg);
> +    tcg_temp_free_i32(t_desc);
> +
> +    /*
> +     * Replicate that first octaword.
> +     * The replication happens in units of 32; if the full vector size
> +     * is not a multiple of 32, the final bits are zeroed.
> +     */
> +    doff = vec_full_reg_offset(s, zt);

Similarly in do_ldrq() this variable is named "dofs".

> +    vsz_r32 = QEMU_ALIGN_DOWN(vsz, 32);
> +    if (vsz >= 64) {
> +        tcg_gen_gvec_dup_mem(5, doff + 32, doff, vsz_r32 - 32, vsz - 32);
> +    } else if (vsz > vsz_r32) {
> +        /* Nop move, with side effect of clearing the tail. */
> +        tcg_gen_gvec_mov(MO_64, doff, doff, vsz_r32, vsz);
> +    }
> +}
> +

Otherwise
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>

thanks
-- PMM