* [PATCH v2 00/12] target/arm: Implement BFloat16
@ 2021-05-25 22:58 Richard Henderson
2021-05-25 22:58 ` [PATCH v2 01/12] target/arm: Add isar_feature_{aa32, aa64, aa64_sve}_bf16 Richard Henderson
` (12 more replies)
0 siblings, 13 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Whee! All prerequisites are now merged.
Patches missing r-b:
05-softfpu-Add-float_round_to_odd_inf.patch
06-target-arm-Implement-bfloat16-dot-product-vector.patch
07-target-arm-Implement-bfloat16-dot-product-indexed.patch
11-linux-user-aarch64-Enable-hwcap-bits-for-bfloat16.patch
12-target-arm-Enable-BFloat16-extensions.patch
Per the question of whether additional checks vs VFP or NEON
are required, I have disabled BF16 when either VFP or NEON are
also disabled. Which seems like a similarly reasonable choice.
r~
Richard Henderson (12):
target/arm: Add isar_feature_{aa32,aa64,aa64_sve}_bf16
target/arm: Unify unallocated path in disas_fp_1src
target/arm: Implement scalar float32 to bfloat16 conversion
target/arm: Implement vector float32 to bfloat16 conversion
softfpu: Add float_round_to_odd_inf
target/arm: Implement bfloat16 dot product (vector)
target/arm: Implement bfloat16 dot product (indexed)
target/arm: Implement bfloat16 matrix multiply accumulate
target/arm: Implement bfloat widening fma (vector)
target/arm: Implement bfloat widening fma (indexed)
linux-user/aarch64: Enable hwcap bits for bfloat16
target/arm: Enable BFloat16 extensions
include/fpu/softfloat-types.h | 4 +-
target/arm/cpu.h | 15 ++++
target/arm/helper-sve.h | 4 +
target/arm/helper.h | 15 ++++
target/arm/neon-dp.decode | 1 +
target/arm/neon-shared.decode | 11 +++
target/arm/sve.decode | 19 ++++-
target/arm/vfp.decode | 2 +
linux-user/elfload.c | 2 +
target/arm/cpu.c | 3 +
target/arm/cpu64.c | 3 +
target/arm/cpu_tcg.c | 1 +
target/arm/sve_helper.c | 2 +
target/arm/translate-a64.c | 142 +++++++++++++++++++++++++++++-----
target/arm/translate-neon.c | 91 ++++++++++++++++++++++
target/arm/translate-sve.c | 112 +++++++++++++++++++++++++++
target/arm/translate-vfp.c | 24 ++++++
target/arm/vec_helper.c | 140 ++++++++++++++++++++++++++++++++-
target/arm/vfp_helper.c | 12 +++
fpu/softfloat-parts.c.inc | 6 +-
20 files changed, 584 insertions(+), 25 deletions(-)
--
2.25.1
^ permalink raw reply [flat|nested] 14+ messages in thread
* [PATCH v2 01/12] target/arm: Add isar_feature_{aa32, aa64, aa64_sve}_bf16
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 02/12] target/arm: Unify unallocated path in disas_fp_1src Richard Henderson
` (11 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Maydell, qemu-arm
Note that the SVE BFLOAT16 support does not require SVE2,
it is an independent extension.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)
diff --git a/target/arm/cpu.h b/target/arm/cpu.h
index 04f8be35bf..d68275b15e 100644
--- a/target/arm/cpu.h
+++ b/target/arm/cpu.h
@@ -3783,6 +3783,11 @@ static inline bool isar_feature_aa32_predinv(const ARMISARegisters *id)
return FIELD_EX32(id->id_isar6, ID_ISAR6, SPECRES) != 0;
}
+static inline bool isar_feature_aa32_bf16(const ARMISARegisters *id)
+{
+ return FIELD_EX32(id->id_isar6, ID_ISAR6, BF16) != 0;
+}
+
static inline bool isar_feature_aa32_i8mm(const ARMISARegisters *id)
{
return FIELD_EX32(id->id_isar6, ID_ISAR6, I8MM) != 0;
@@ -4122,6 +4127,11 @@ static inline bool isar_feature_aa64_dcpodp(const ARMISARegisters *id)
return FIELD_EX64(id->id_aa64isar1, ID_AA64ISAR1, DPB) >= 2;
}
+static inline bool isar_feature_aa64_bf16(const ARMISARegisters *id)
+{
+ return FIELD_EX64(id->id_aa64isar1, ID_AA64ISAR1, BF16) != 0;
+}
+
static inline bool isar_feature_aa64_fp_simd(const ARMISARegisters *id)
{
/* We always set the AdvSIMD and FP fields identically. */
@@ -4266,6 +4276,11 @@ static inline bool isar_feature_aa64_sve2_bitperm(const ARMISARegisters *id)
return FIELD_EX64(id->id_aa64zfr0, ID_AA64ZFR0, BITPERM) != 0;
}
+static inline bool isar_feature_aa64_sve_bf16(const ARMISARegisters *id)
+{
+ return FIELD_EX64(id->id_aa64zfr0, ID_AA64ZFR0, BFLOAT16) != 0;
+}
+
static inline bool isar_feature_aa64_sve2_sha3(const ARMISARegisters *id)
{
return FIELD_EX64(id->id_aa64zfr0, ID_AA64ZFR0, SHA3) != 0;
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 02/12] target/arm: Unify unallocated path in disas_fp_1src
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
2021-05-25 22:58 ` [PATCH v2 01/12] target/arm: Add isar_feature_{aa32, aa64, aa64_sve}_bf16 Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 03/12] target/arm: Implement scalar float32 to bfloat16 conversion Richard Henderson
` (10 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Maydell, qemu-arm
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/translate-a64.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index ceac0ee2bd..510cb6ca5e 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -6494,8 +6494,7 @@ static void disas_fp_1src(DisasContext *s, uint32_t insn)
int rd = extract32(insn, 0, 5);
if (mos) {
- unallocated_encoding(s);
- return;
+ goto do_unallocated;
}
switch (opcode) {
@@ -6504,8 +6503,7 @@ static void disas_fp_1src(DisasContext *s, uint32_t insn)
/* FCVT between half, single and double precision */
int dtype = extract32(opcode, 0, 2);
if (type == 2 || dtype == type) {
- unallocated_encoding(s);
- return;
+ goto do_unallocated;
}
if (!fp_access_check(s)) {
return;
@@ -6517,8 +6515,7 @@ static void disas_fp_1src(DisasContext *s, uint32_t insn)
case 0x10 ... 0x13: /* FRINT{32,64}{X,Z} */
if (type > 1 || !dc_isar_feature(aa64_frint, s)) {
- unallocated_encoding(s);
- return;
+ goto do_unallocated;
}
/* fall through */
case 0x0 ... 0x3:
@@ -6540,8 +6537,7 @@ static void disas_fp_1src(DisasContext *s, uint32_t insn)
break;
case 3:
if (!dc_isar_feature(aa64_fp16, s)) {
- unallocated_encoding(s);
- return;
+ goto do_unallocated;
}
if (!fp_access_check(s)) {
@@ -6550,11 +6546,12 @@ static void disas_fp_1src(DisasContext *s, uint32_t insn)
handle_fp_1src_half(s, opcode, rd, rn);
break;
default:
- unallocated_encoding(s);
+ goto do_unallocated;
}
break;
default:
+ do_unallocated:
unallocated_encoding(s);
break;
}
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 03/12] target/arm: Implement scalar float32 to bfloat16 conversion
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
2021-05-25 22:58 ` [PATCH v2 01/12] target/arm: Add isar_feature_{aa32, aa64, aa64_sve}_bf16 Richard Henderson
2021-05-25 22:58 ` [PATCH v2 02/12] target/arm: Unify unallocated path in disas_fp_1src Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 04/12] target/arm: Implement vector " Richard Henderson
` (9 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Maydell, qemu-arm
This is the 64-bit BFCVT and the 32-bit VCVT{B,T}.BF16.F32.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 1 +
target/arm/vfp.decode | 2 ++
target/arm/translate-a64.c | 19 +++++++++++++++++++
target/arm/translate-vfp.c | 24 ++++++++++++++++++++++++
target/arm/vfp_helper.c | 5 +++++
5 files changed, 51 insertions(+)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 23ccb0f72f..9977a827e9 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -143,6 +143,7 @@ DEF_HELPER_3(vfp_cmped, void, f64, f64, env)
DEF_HELPER_2(vfp_fcvtds, f64, f32, env)
DEF_HELPER_2(vfp_fcvtsd, f32, f64, env)
+DEF_HELPER_FLAGS_2(bfcvt, TCG_CALL_NO_RWG, i32, f32, ptr)
DEF_HELPER_2(vfp_uitoh, f16, i32, ptr)
DEF_HELPER_2(vfp_uitos, f32, i32, ptr)
diff --git a/target/arm/vfp.decode b/target/arm/vfp.decode
index 6f7f28f9a4..52535d9b0b 100644
--- a/target/arm/vfp.decode
+++ b/target/arm/vfp.decode
@@ -205,6 +205,8 @@ VCVT_f64_f16 ---- 1110 1.11 0010 .... 1011 t:1 1.0 .... \
# VCVTB and VCVTT to f16: Vd format is always vd_sp;
# Vm format depends on size bit
+VCVT_b16_f32 ---- 1110 1.11 0011 .... 1001 t:1 1.0 .... \
+ vd=%vd_sp vm=%vm_sp
VCVT_f16_f32 ---- 1110 1.11 0011 .... 1010 t:1 1.0 .... \
vd=%vd_sp vm=%vm_sp
VCVT_f16_f64 ---- 1110 1.11 0011 .... 1011 t:1 1.0 .... \
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 510cb6ca5e..90605d7dce 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -6273,6 +6273,9 @@ static void handle_fp_1src_single(DisasContext *s, int opcode, int rd, int rn)
case 0x3: /* FSQRT */
gen_helper_vfp_sqrts(tcg_res, tcg_op, cpu_env);
goto done;
+ case 0x6: /* BFCVT */
+ gen_fpst = gen_helper_bfcvt;
+ break;
case 0x8: /* FRINTN */
case 0x9: /* FRINTP */
case 0xa: /* FRINTM */
@@ -6550,6 +6553,22 @@ static void disas_fp_1src(DisasContext *s, uint32_t insn)
}
break;
+ case 0x6:
+ switch (type) {
+ case 1: /* BFCVT */
+ if (!dc_isar_feature(aa64_bf16, s)) {
+ goto do_unallocated;
+ }
+ if (!fp_access_check(s)) {
+ return;
+ }
+ handle_fp_1src_single(s, opcode, rd, rn);
+ break;
+ default:
+ goto do_unallocated;
+ }
+ break;
+
default:
do_unallocated:
unallocated_encoding(s);
diff --git a/target/arm/translate-vfp.c b/target/arm/translate-vfp.c
index 3da84f30a0..d8271dbaac 100644
--- a/target/arm/translate-vfp.c
+++ b/target/arm/translate-vfp.c
@@ -3025,6 +3025,30 @@ static bool trans_VCVT_f64_f16(DisasContext *s, arg_VCVT_f64_f16 *a)
return true;
}
+static bool trans_VCVT_b16_f32(DisasContext *s, arg_VCVT_b16_f32 *a)
+{
+ TCGv_ptr fpst;
+ TCGv_i32 tmp;
+
+ if (!dc_isar_feature(aa32_bf16, s)) {
+ return false;
+ }
+
+ if (!vfp_access_check(s)) {
+ return true;
+ }
+
+ fpst = fpstatus_ptr(FPST_FPCR);
+ tmp = tcg_temp_new_i32();
+
+ vfp_load_reg32(tmp, a->vm);
+ gen_helper_bfcvt(tmp, tmp, fpst);
+ tcg_gen_st16_i32(tmp, cpu_env, vfp_f16_offset(a->vd, a->t));
+ tcg_temp_free_ptr(fpst);
+ tcg_temp_free_i32(tmp);
+ return true;
+}
+
static bool trans_VCVT_f16_f32(DisasContext *s, arg_VCVT_f16_f32 *a)
{
TCGv_ptr fpst;
diff --git a/target/arm/vfp_helper.c b/target/arm/vfp_helper.c
index 01b9d8557f..fe7a2a5daa 100644
--- a/target/arm/vfp_helper.c
+++ b/target/arm/vfp_helper.c
@@ -408,6 +408,11 @@ float32 VFP_HELPER(fcvts, d)(float64 x, CPUARMState *env)
return float64_to_float32(x, &env->vfp.fp_status);
}
+uint32_t HELPER(bfcvt)(float32 x, void *status)
+{
+ return float32_to_bfloat16(x, status);
+}
+
/*
* VFP3 fixed point conversion. The AArch32 versions of fix-to-float
* must always round-to-nearest; the AArch64 ones honour the FPSCR
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 04/12] target/arm: Implement vector float32 to bfloat16 conversion
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (2 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 03/12] target/arm: Implement scalar float32 to bfloat16 conversion Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 05/12] softfpu: Add float_round_to_odd_inf Richard Henderson
` (8 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Maydell, qemu-arm
This is BFCVT{N,T} for both AArch64 AdvSIMD and SVE,
and VCVT.BF16.F32 for AArch32 NEON.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper-sve.h | 4 ++++
target/arm/helper.h | 1 +
target/arm/neon-dp.decode | 1 +
target/arm/sve.decode | 2 ++
target/arm/sve_helper.c | 2 ++
target/arm/translate-a64.c | 17 ++++++++++++++
target/arm/translate-neon.c | 45 +++++++++++++++++++++++++++++++++++++
target/arm/translate-sve.c | 16 +++++++++++++
target/arm/vfp_helper.c | 7 ++++++
9 files changed, 95 insertions(+)
diff --git a/target/arm/helper-sve.h b/target/arm/helper-sve.h
index 29a14a21f5..dc629f851a 100644
--- a/target/arm/helper-sve.h
+++ b/target/arm/helper-sve.h
@@ -1197,6 +1197,8 @@ DEF_HELPER_FLAGS_5(sve_fcvt_hd, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(sve_fcvt_sd, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sve_bfcvt, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(sve_fcvtzs_hh, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
@@ -2752,6 +2754,8 @@ DEF_HELPER_FLAGS_5(sve2_fcvtnt_sh, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(sve2_fcvtnt_ds, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(sve_bfcvtnt, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, i32)
DEF_HELPER_FLAGS_5(sve2_fcvtlt_hs, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 9977a827e9..8b4b7d92f3 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -144,6 +144,7 @@ DEF_HELPER_3(vfp_cmped, void, f64, f64, env)
DEF_HELPER_2(vfp_fcvtds, f64, f32, env)
DEF_HELPER_2(vfp_fcvtsd, f32, f64, env)
DEF_HELPER_FLAGS_2(bfcvt, TCG_CALL_NO_RWG, i32, f32, ptr)
+DEF_HELPER_FLAGS_2(bfcvt_pair, TCG_CALL_NO_RWG, i32, i64, ptr)
DEF_HELPER_2(vfp_uitoh, f16, i32, ptr)
DEF_HELPER_2(vfp_uitos, f32, i32, ptr)
diff --git a/target/arm/neon-dp.decode b/target/arm/neon-dp.decode
index ec83f10ab3..fd3a01bfa0 100644
--- a/target/arm/neon-dp.decode
+++ b/target/arm/neon-dp.decode
@@ -521,6 +521,7 @@ Vimm_1r 1111 001 . 1 . 000 ... .... cmode:4 0 . op:1 1 .... @1reg_imm
VRINTZ 1111 001 11 . 11 .. 10 .... 0 1011 . . 0 .... @2misc
VCVT_F16_F32 1111 001 11 . 11 .. 10 .... 0 1100 0 . 0 .... @2misc_q0
+ VCVT_B16_F32 1111 001 11 . 11 .. 10 .... 0 1100 1 . 0 .... @2misc_q0
VRINTM 1111 001 11 . 11 .. 10 .... 0 1101 . . 0 .... @2misc
diff --git a/target/arm/sve.decode b/target/arm/sve.decode
index cb077bfde9..18d1a0eecc 100644
--- a/target/arm/sve.decode
+++ b/target/arm/sve.decode
@@ -1036,6 +1036,7 @@ FNMLS_zpzzz 01100101 .. 1 ..... 111 ... ..... ..... @rdn_pg_rm_ra
# SVE floating-point convert precision
FCVT_sh 01100101 10 0010 00 101 ... ..... ..... @rd_pg_rn_e0
FCVT_hs 01100101 10 0010 01 101 ... ..... ..... @rd_pg_rn_e0
+BFCVT 01100101 10 0010 10 101 ... ..... ..... @rd_pg_rn_e0
FCVT_dh 01100101 11 0010 00 101 ... ..... ..... @rd_pg_rn_e0
FCVT_hd 01100101 11 0010 01 101 ... ..... ..... @rd_pg_rn_e0
FCVT_ds 01100101 11 0010 10 101 ... ..... ..... @rd_pg_rn_e0
@@ -1610,6 +1611,7 @@ RAX1 01000101 00 1 ..... 11110 1 ..... ..... @rd_rn_rm_e0
FCVTXNT_ds 01100100 00 0010 10 101 ... ..... ..... @rd_pg_rn_e0
FCVTX_ds 01100101 00 0010 10 101 ... ..... ..... @rd_pg_rn_e0
FCVTNT_sh 01100100 10 0010 00 101 ... ..... ..... @rd_pg_rn_e0
+BFCVTNT 01100100 10 0010 10 101 ... ..... ..... @rd_pg_rn_e0
FCVTLT_hs 01100100 10 0010 01 101 ... ..... ..... @rd_pg_rn_e0
FCVTNT_ds 01100100 11 0010 10 101 ... ..... ..... @rd_pg_rn_e0
FCVTLT_sd 01100100 11 0010 11 101 ... ..... ..... @rd_pg_rn_e0
diff --git a/target/arm/sve_helper.c b/target/arm/sve_helper.c
index 40af3024df..46a957b6fb 100644
--- a/target/arm/sve_helper.c
+++ b/target/arm/sve_helper.c
@@ -4708,6 +4708,7 @@ static inline uint64_t vfp_float64_to_uint64_rtz(float64 f, float_status *s)
DO_ZPZ_FP(sve_fcvt_sh, uint32_t, H1_4, sve_f32_to_f16)
DO_ZPZ_FP(sve_fcvt_hs, uint32_t, H1_4, sve_f16_to_f32)
+DO_ZPZ_FP(sve_bfcvt, uint32_t, H1_4, float32_to_bfloat16)
DO_ZPZ_FP(sve_fcvt_dh, uint64_t, , sve_f64_to_f16)
DO_ZPZ_FP(sve_fcvt_hd, uint64_t, , sve_f16_to_f64)
DO_ZPZ_FP(sve_fcvt_ds, uint64_t, , float64_to_float32)
@@ -7740,6 +7741,7 @@ void HELPER(NAME)(void *vd, void *vn, void *vg, void *status, uint32_t desc) \
} while (i != 0); \
}
+DO_FCVTNT(sve_bfcvtnt, uint32_t, uint16_t, H1_4, H1_2, float32_to_bfloat16)
DO_FCVTNT(sve2_fcvtnt_sh, uint32_t, uint16_t, H1_4, H1_2, sve_f32_to_f16)
DO_FCVTNT(sve2_fcvtnt_ds, uint64_t, uint32_t, , H1_4, float64_to_float32)
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 90605d7dce..5a96523b9f 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -10346,6 +10346,13 @@ static void handle_2misc_narrow(DisasContext *s, bool scalar,
tcg_temp_free_i32(ahp);
}
break;
+ case 0x36: /* BFCVTN, BFCVTN2 */
+ {
+ TCGv_ptr fpst = fpstatus_ptr(FPST_FPCR);
+ gen_helper_bfcvt_pair(tcg_res[pass], tcg_op, fpst);
+ tcg_temp_free_ptr(fpst);
+ }
+ break;
case 0x56: /* FCVTXN, FCVTXN2 */
/* 64 bit to 32 bit float conversion
* with von Neumann rounding (round to odd)
@@ -12746,6 +12753,16 @@ static void disas_simd_two_reg_misc(DisasContext *s, uint32_t insn)
}
handle_2misc_narrow(s, false, opcode, 0, is_q, size - 1, rn, rd);
return;
+ case 0x36: /* BFCVTN, BFCVTN2 */
+ if (!dc_isar_feature(aa64_bf16, s) || size != 2) {
+ unallocated_encoding(s);
+ return;
+ }
+ if (!fp_access_check(s)) {
+ return;
+ }
+ handle_2misc_narrow(s, false, opcode, 0, is_q, size - 1, rn, rd);
+ return;
case 0x17: /* FCVTL, FCVTL2 */
if (!fp_access_check(s)) {
return;
diff --git a/target/arm/translate-neon.c b/target/arm/translate-neon.c
index 9e990b41ed..6d94229c69 100644
--- a/target/arm/translate-neon.c
+++ b/target/arm/translate-neon.c
@@ -3422,6 +3422,51 @@ static bool trans_VSHLL(DisasContext *s, arg_2misc *a)
return true;
}
+static bool trans_VCVT_B16_F32(DisasContext *s, arg_2misc *a)
+{
+ TCGv_ptr fpst;
+ TCGv_i64 tmp;
+ TCGv_i32 dst0, dst1;
+
+ if (!dc_isar_feature(aa32_bf16, s)) {
+ return false;
+ }
+
+ /* UNDEF accesses to D16-D31 if they don't exist. */
+ if (!dc_isar_feature(aa32_simd_r32, s) &&
+ ((a->vd | a->vm) & 0x10)) {
+ return false;
+ }
+
+ if ((a->vm & 1) || (a->size != 1)) {
+ return false;
+ }
+
+ if (!vfp_access_check(s)) {
+ return true;
+ }
+
+ fpst = fpstatus_ptr(FPST_STD);
+ tmp = tcg_temp_new_i64();
+ dst0 = tcg_temp_new_i32();
+ dst1 = tcg_temp_new_i32();
+
+ read_neon_element64(tmp, a->vm, 0, MO_64);
+ gen_helper_bfcvt_pair(dst0, tmp, fpst);
+
+ read_neon_element64(tmp, a->vm, 1, MO_64);
+ gen_helper_bfcvt_pair(dst1, tmp, fpst);
+
+ write_neon_element32(dst0, a->vd, 0, MO_32);
+ write_neon_element32(dst1, a->vd, 1, MO_32);
+
+ tcg_temp_free_i64(tmp);
+ tcg_temp_free_i32(dst0);
+ tcg_temp_free_i32(dst1);
+ tcg_temp_free_ptr(fpst);
+ return true;
+}
+
static bool trans_VCVT_F16_F32(DisasContext *s, arg_2misc *a)
{
TCGv_ptr fpst;
diff --git a/target/arm/translate-sve.c b/target/arm/translate-sve.c
index 9574efe957..fb692a1835 100644
--- a/target/arm/translate-sve.c
+++ b/target/arm/translate-sve.c
@@ -4777,6 +4777,14 @@ static bool trans_FCVT_hs(DisasContext *s, arg_rpr_esz *a)
return do_zpz_ptr(s, a->rd, a->rn, a->pg, false, gen_helper_sve_fcvt_hs);
}
+static bool trans_BFCVT(DisasContext *s, arg_rpr_esz *a)
+{
+ if (!dc_isar_feature(aa64_sve_bf16, s)) {
+ return false;
+ }
+ return do_zpz_ptr(s, a->rd, a->rn, a->pg, false, gen_helper_sve_bfcvt);
+}
+
static bool trans_FCVT_dh(DisasContext *s, arg_rpr_esz *a)
{
return do_zpz_ptr(s, a->rd, a->rn, a->pg, false, gen_helper_sve_fcvt_dh);
@@ -8472,6 +8480,14 @@ static bool trans_FCVTNT_sh(DisasContext *s, arg_rpr_esz *a)
return do_zpz_ptr(s, a->rd, a->rn, a->pg, false, gen_helper_sve2_fcvtnt_sh);
}
+static bool trans_BFCVTNT(DisasContext *s, arg_rpr_esz *a)
+{
+ if (!dc_isar_feature(aa64_sve_bf16, s)) {
+ return false;
+ }
+ return do_zpz_ptr(s, a->rd, a->rn, a->pg, false, gen_helper_sve_bfcvtnt);
+}
+
static bool trans_FCVTNT_ds(DisasContext *s, arg_rpr_esz *a)
{
if (!dc_isar_feature(aa64_sve2, s)) {
diff --git a/target/arm/vfp_helper.c b/target/arm/vfp_helper.c
index fe7a2a5daa..3328423cec 100644
--- a/target/arm/vfp_helper.c
+++ b/target/arm/vfp_helper.c
@@ -413,6 +413,13 @@ uint32_t HELPER(bfcvt)(float32 x, void *status)
return float32_to_bfloat16(x, status);
}
+uint32_t HELPER(bfcvt_pair)(uint64_t pair, void *status)
+{
+ bfloat16 lo = float32_to_bfloat16(extract64(pair, 0, 32), status);
+ bfloat16 hi = float32_to_bfloat16(extract64(pair, 32, 32), status);
+ return deposit32(lo, 16, 16, hi);
+}
+
/*
* VFP3 fixed point conversion. The AArch32 versions of fix-to-float
* must always round-to-nearest; the AArch64 ones honour the FPSCR
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 05/12] softfpu: Add float_round_to_odd_inf
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (3 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 04/12] target/arm: Implement vector " Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 06/12] target/arm: Implement bfloat16 dot product (vector) Richard Henderson
` (7 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm, Alex Bennée
For Arm BFDOT and BFMMLA, we need a version of round-to-odd
that overflows to infinity, instead of the max normal number.
Cc: Alex Bennée <alex.bennee@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
include/fpu/softfloat-types.h | 4 +++-
fpu/softfloat-parts.c.inc | 6 ++++--
2 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/include/fpu/softfloat-types.h b/include/fpu/softfloat-types.h
index 8a3f20fae9..3b757c3d6a 100644
--- a/include/fpu/softfloat-types.h
+++ b/include/fpu/softfloat-types.h
@@ -134,8 +134,10 @@ typedef enum __attribute__((__packed__)) {
float_round_up = 2,
float_round_to_zero = 3,
float_round_ties_away = 4,
- /* Not an IEEE rounding mode: round to the closest odd mantissa value */
+ /* Not an IEEE rounding mode: round to closest odd, overflow to max */
float_round_to_odd = 5,
+ /* Not an IEEE rounding mode: round to closest odd, overflow to inf */
+ float_round_to_odd_inf = 6,
} FloatRoundMode;
/*
diff --git a/fpu/softfloat-parts.c.inc b/fpu/softfloat-parts.c.inc
index a897a5a743..7f69da1d8f 100644
--- a/fpu/softfloat-parts.c.inc
+++ b/fpu/softfloat-parts.c.inc
@@ -176,13 +176,12 @@ static void partsN(uncanon)(FloatPartsN *p, float_status *s,
g_assert_not_reached();
}
+ overflow_norm = false;
switch (s->float_rounding_mode) {
case float_round_nearest_even:
- overflow_norm = false;
inc = ((p->frac_lo & roundeven_mask) != frac_lsbm1 ? frac_lsbm1 : 0);
break;
case float_round_ties_away:
- overflow_norm = false;
inc = frac_lsbm1;
break;
case float_round_to_zero:
@@ -199,6 +198,8 @@ static void partsN(uncanon)(FloatPartsN *p, float_status *s,
break;
case float_round_to_odd:
overflow_norm = true;
+ /* fall through */
+ case float_round_to_odd_inf:
inc = p->frac_lo & frac_lsb ? 0 : round_mask;
break;
default:
@@ -259,6 +260,7 @@ static void partsN(uncanon)(FloatPartsN *p, float_status *s,
? frac_lsbm1 : 0);
break;
case float_round_to_odd:
+ case float_round_to_odd_inf:
inc = p->frac_lo & frac_lsb ? 0 : round_mask;
break;
default:
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 06/12] target/arm: Implement bfloat16 dot product (vector)
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (4 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 05/12] softfpu: Add float_round_to_odd_inf Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 07/12] target/arm: Implement bfloat16 dot product (indexed) Richard Henderson
` (6 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
This is BFDOT for both AArch64 AdvSIMD and SVE,
and VDOT.BF16 for AArch32 NEON.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 3 +++
target/arm/neon-shared.decode | 2 ++
target/arm/sve.decode | 3 +++
target/arm/translate-a64.c | 20 ++++++++++++++++++
target/arm/translate-neon.c | 9 ++++++++
target/arm/translate-sve.c | 12 +++++++++++
target/arm/vec_helper.c | 40 +++++++++++++++++++++++++++++++++++
7 files changed, 89 insertions(+)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 8b4b7d92f3..de2f5331dc 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1002,6 +1002,9 @@ DEF_HELPER_FLAGS_5(gvec_ummla_b, TCG_CALL_NO_RWG,
DEF_HELPER_FLAGS_5(gvec_usmmla_b, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_bfdot, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, i32)
+
#ifdef TARGET_AARCH64
#include "helper-a64.h"
#include "helper-sve.h"
diff --git a/target/arm/neon-shared.decode b/target/arm/neon-shared.decode
index cc9f4cdd85..31a0839bbb 100644
--- a/target/arm/neon-shared.decode
+++ b/target/arm/neon-shared.decode
@@ -52,6 +52,8 @@ VUDOT 1111 110 00 . 10 .... .... 1101 . q:1 . 1 .... \
vm=%vm_dp vn=%vn_dp vd=%vd_dp
VUSDOT 1111 110 01 . 10 .... .... 1101 . q:1 . 0 .... \
vm=%vm_dp vn=%vn_dp vd=%vd_dp
+VDOT_b16 1111 110 00 . 00 .... .... 1101 . q:1 . 0 .... \
+ vm=%vm_dp vn=%vn_dp vd=%vd_dp
# VFM[AS]L
VFML 1111 110 0 s:1 . 10 .... .... 1000 . 0 . 1 .... \
diff --git a/target/arm/sve.decode b/target/arm/sve.decode
index 18d1a0eecc..a7429b293f 100644
--- a/target/arm/sve.decode
+++ b/target/arm/sve.decode
@@ -1625,6 +1625,9 @@ FMLALT_zzzw 01100100 10 1 ..... 10 0 00 1 ..... ..... @rda_rn_rm_e0
FMLSLB_zzzw 01100100 10 1 ..... 10 1 00 0 ..... ..... @rda_rn_rm_e0
FMLSLT_zzzw 01100100 10 1 ..... 10 1 00 1 ..... ..... @rda_rn_rm_e0
+### SVE2 floating-point bfloat16 dot-product
+BFDOT_zzzz 01100100 01 1 ..... 10 0 00 0 ..... ..... @rda_rn_rm_e0
+
### SVE2 floating-point multiply-add long (indexed)
FMLALB_zzxw 01100100 10 1 ..... 0100.0 ..... ..... @rrxr_3a esz=2
FMLALT_zzxw 01100100 10 1 ..... 0100.1 ..... ..... @rrxr_3a esz=2
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 5a96523b9f..127c8d8e9d 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -12228,6 +12228,16 @@ static void disas_simd_three_reg_same_extra(DisasContext *s, uint32_t insn)
}
feature = dc_isar_feature(aa64_fcma, s);
break;
+ case 0x1f: /* BFDOT */
+ switch (size) {
+ case 1:
+ feature = dc_isar_feature(aa64_bf16, s);
+ break;
+ default:
+ unallocated_encoding(s);
+ return;
+ }
+ break;
default:
unallocated_encoding(s);
return;
@@ -12311,6 +12321,16 @@ static void disas_simd_three_reg_same_extra(DisasContext *s, uint32_t insn)
}
return;
+ case 0xf: /* BFDOT */
+ switch (size) {
+ case 1:
+ gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, 0, gen_helper_gvec_bfdot);
+ break;
+ default:
+ g_assert_not_reached();
+ }
+ return;
+
default:
g_assert_not_reached();
}
diff --git a/target/arm/translate-neon.c b/target/arm/translate-neon.c
index 6d94229c69..9460857b2a 100644
--- a/target/arm/translate-neon.c
+++ b/target/arm/translate-neon.c
@@ -296,6 +296,15 @@ static bool trans_VUSDOT(DisasContext *s, arg_VUSDOT *a)
gen_helper_gvec_usdot_b);
}
+static bool trans_VDOT_b16(DisasContext *s, arg_VDOT_b16 *a)
+{
+ if (!dc_isar_feature(aa32_bf16, s)) {
+ return false;
+ }
+ return do_neon_ddda(s, a->q * 7, a->vd, a->vn, a->vm, 0,
+ gen_helper_gvec_bfdot);
+}
+
static bool trans_VFML(DisasContext *s, arg_VFML *a)
{
int opr_sz;
diff --git a/target/arm/translate-sve.c b/target/arm/translate-sve.c
index fb692a1835..ed290827ad 100644
--- a/target/arm/translate-sve.c
+++ b/target/arm/translate-sve.c
@@ -8653,3 +8653,15 @@ static bool trans_UMMLA(DisasContext *s, arg_rrrr_esz *a)
{
return do_i8mm_zzzz_ool(s, a, gen_helper_gvec_ummla_b, 0);
}
+
+static bool trans_BFDOT_zzzz(DisasContext *s, arg_rrrr_esz *a)
+{
+ if (!dc_isar_feature(aa64_sve_bf16, s)) {
+ return false;
+ }
+ if (sve_access_check(s)) {
+ gen_gvec_ool_zzzz(s, gen_helper_gvec_bfdot,
+ a->rd, a->rn, a->rm, a->ra, 0);
+ }
+ return true;
+}
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index e84b438340..7eefcd06ea 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -2412,3 +2412,43 @@ static void do_mmla_b(void *vd, void *vn, void *vm, void *va, uint32_t desc,
DO_MMLA_B(gvec_smmla_b, do_smmla_b)
DO_MMLA_B(gvec_ummla_b, do_ummla_b)
DO_MMLA_B(gvec_usmmla_b, do_usmmla_b)
+
+/*
+ * BFloat16 Dot Product
+ */
+
+static float32 bfdotadd(float32 sum, uint32_t e1, uint32_t e2)
+{
+ /* FPCR is ignored for BFDOT and BFMMLA. */
+ float_status bf_status = {
+ .tininess_before_rounding = float_tininess_before_rounding,
+ .float_rounding_mode = float_round_to_odd_inf,
+ .flush_to_zero = true,
+ .flush_inputs_to_zero = true,
+ .default_nan_mode = true,
+ };
+ float32 t1, t2;
+
+ /*
+ * Extract each BFloat16 from the element pair, and shift
+ * them such that they become float32.
+ */
+ t1 = float32_mul(e1 << 16, e2 << 16, &bf_status);
+ t2 = float32_mul(e1 & 0xffff0000u, e2 & 0xffff0000u, &bf_status);
+ t1 = float32_add(t1, t2, &bf_status);
+ t1 = float32_add(sum, t1, &bf_status);
+
+ return t1;
+}
+
+void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
+{
+ intptr_t i, opr_sz = simd_oprsz(desc);
+ float32 *d = vd, *a = va;
+ uint32_t *n = vn, *m = vm;
+
+ for (i = 0; i < opr_sz / 4; ++i) {
+ d[i] = bfdotadd(a[i], n[i], m[i]);
+ }
+ clear_tail(d, opr_sz, simd_maxsz(desc));
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 07/12] target/arm: Implement bfloat16 dot product (indexed)
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (5 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 06/12] target/arm: Implement bfloat16 dot product (vector) Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 08/12] target/arm: Implement bfloat16 matrix multiply accumulate Richard Henderson
` (5 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
This is BFDOT for both AArch64 AdvSIMD and SVE,
and VDOT.BF16 for AArch32 NEON.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 2 ++
target/arm/neon-shared.decode | 2 ++
target/arm/sve.decode | 3 +++
target/arm/translate-a64.c | 41 +++++++++++++++++++++++++++--------
target/arm/translate-neon.c | 9 ++++++++
target/arm/translate-sve.c | 12 ++++++++++
target/arm/vec_helper.c | 20 +++++++++++++++++
7 files changed, 80 insertions(+), 9 deletions(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index de2f5331dc..376c1cef0f 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1004,6 +1004,8 @@ DEF_HELPER_FLAGS_5(gvec_usmmla_b, TCG_CALL_NO_RWG,
DEF_HELPER_FLAGS_5(gvec_bfdot, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_bfdot_idx, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, i32)
#ifdef TARGET_AARCH64
#include "helper-a64.h"
diff --git a/target/arm/neon-shared.decode b/target/arm/neon-shared.decode
index 31a0839bbb..fa3cf14e3a 100644
--- a/target/arm/neon-shared.decode
+++ b/target/arm/neon-shared.decode
@@ -81,6 +81,8 @@ VUSDOT_scalar 1111 1110 1 . 00 .... .... 1101 . q:1 index:1 0 vm:4 \
vn=%vn_dp vd=%vd_dp
VSUDOT_scalar 1111 1110 1 . 00 .... .... 1101 . q:1 index:1 1 vm:4 \
vn=%vn_dp vd=%vd_dp
+VDOT_b16_scal 1111 1110 0 . 00 .... .... 1101 . q:1 index:1 0 vm:4 \
+ vn=%vn_dp vd=%vd_dp
%vfml_scalar_q0_rm 0:3 5:1
%vfml_scalar_q1_index 5:1 3:1
diff --git a/target/arm/sve.decode b/target/arm/sve.decode
index a7429b293f..51f87e8937 100644
--- a/target/arm/sve.decode
+++ b/target/arm/sve.decode
@@ -1633,3 +1633,6 @@ FMLALB_zzxw 01100100 10 1 ..... 0100.0 ..... ..... @rrxr_3a esz=2
FMLALT_zzxw 01100100 10 1 ..... 0100.1 ..... ..... @rrxr_3a esz=2
FMLSLB_zzxw 01100100 10 1 ..... 0110.0 ..... ..... @rrxr_3a esz=2
FMLSLT_zzxw 01100100 10 1 ..... 0110.1 ..... ..... @rrxr_3a esz=2
+
+### SVE2 floating-point bfloat16 dot-product (indexed)
+BFDOT_zzxz 01100100 01 1 ..... 010000 ..... ..... @rrxr_2 esz=2
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 127c8d8e9d..1df931cef5 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -13442,8 +13442,22 @@ static void disas_simd_indexed(DisasContext *s, uint32_t insn)
return;
}
break;
- case 0x0f: /* SUDOT, USDOT */
- if (is_scalar || (size & 1) || !dc_isar_feature(aa64_i8mm, s)) {
+ case 0x0f:
+ switch (size) {
+ case 0: /* SUDOT */
+ case 2: /* USDOT */
+ if (is_scalar || !dc_isar_feature(aa64_i8mm, s)) {
+ unallocated_encoding(s);
+ return;
+ }
+ break;
+ case 1: /* BFDOT */
+ if (is_scalar || !dc_isar_feature(aa64_bf16, s)) {
+ unallocated_encoding(s);
+ return;
+ }
+ break;
+ default:
unallocated_encoding(s);
return;
}
@@ -13563,13 +13577,22 @@ static void disas_simd_indexed(DisasContext *s, uint32_t insn)
u ? gen_helper_gvec_udot_idx_b
: gen_helper_gvec_sdot_idx_b);
return;
- case 0x0f: /* SUDOT, USDOT */
- gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, index,
- extract32(insn, 23, 1)
- ? gen_helper_gvec_usdot_idx_b
- : gen_helper_gvec_sudot_idx_b);
- return;
-
+ case 0x0f:
+ switch (extract32(insn, 22, 2)) {
+ case 0: /* SUDOT */
+ gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, index,
+ gen_helper_gvec_sudot_idx_b);
+ return;
+ case 1: /* BFDOT */
+ gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, index,
+ gen_helper_gvec_bfdot_idx);
+ return;
+ case 2: /* USDOT */
+ gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, index,
+ gen_helper_gvec_usdot_idx_b);
+ return;
+ }
+ g_assert_not_reached();
case 0x11: /* FCMLA #0 */
case 0x13: /* FCMLA #90 */
case 0x15: /* FCMLA #180 */
diff --git a/target/arm/translate-neon.c b/target/arm/translate-neon.c
index 9460857b2a..8099767792 100644
--- a/target/arm/translate-neon.c
+++ b/target/arm/translate-neon.c
@@ -390,6 +390,15 @@ static bool trans_VSUDOT_scalar(DisasContext *s, arg_VSUDOT_scalar *a)
gen_helper_gvec_sudot_idx_b);
}
+static bool trans_VDOT_b16_scal(DisasContext *s, arg_VDOT_b16_scal *a)
+{
+ if (!dc_isar_feature(aa32_bf16, s)) {
+ return false;
+ }
+ return do_neon_ddda(s, a->q * 6, a->vd, a->vn, a->vm, a->index,
+ gen_helper_gvec_bfdot_idx);
+}
+
static bool trans_VFML_scalar(DisasContext *s, arg_VFML_scalar *a)
{
int opr_sz;
diff --git a/target/arm/translate-sve.c b/target/arm/translate-sve.c
index ed290827ad..6f02030635 100644
--- a/target/arm/translate-sve.c
+++ b/target/arm/translate-sve.c
@@ -8665,3 +8665,15 @@ static bool trans_BFDOT_zzzz(DisasContext *s, arg_rrrr_esz *a)
}
return true;
}
+
+static bool trans_BFDOT_zzxz(DisasContext *s, arg_rrxr_esz *a)
+{
+ if (!dc_isar_feature(aa64_sve_bf16, s)) {
+ return false;
+ }
+ if (sve_access_check(s)) {
+ gen_gvec_ool_zzzz(s, gen_helper_gvec_bfdot_idx,
+ a->rd, a->rn, a->rm, a->ra, a->index);
+ }
+ return true;
+}
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index 7eefcd06ea..74a497f38c 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -2452,3 +2452,23 @@ void HELPER(gvec_bfdot)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
}
clear_tail(d, opr_sz, simd_maxsz(desc));
}
+
+void HELPER(gvec_bfdot_idx)(void *vd, void *vn, void *vm,
+ void *va, uint32_t desc)
+{
+ intptr_t i, j, opr_sz = simd_oprsz(desc);
+ intptr_t index = simd_data(desc);
+ intptr_t elements = opr_sz / 4;
+ intptr_t eltspersegment = MIN(16 / 4, elements);
+ float32 *d = vd, *a = va;
+ uint32_t *n = vn, *m = vm;
+
+ for (i = 0; i < elements; i += eltspersegment) {
+ uint32_t m_idx = m[i + H4(index)];
+
+ for (j = i; j < i + eltspersegment; j++) {
+ d[j] = bfdotadd(a[j], n[j], m_idx);
+ }
+ }
+ clear_tail(d, opr_sz, simd_maxsz(desc));
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 08/12] target/arm: Implement bfloat16 matrix multiply accumulate
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (6 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 07/12] target/arm: Implement bfloat16 dot product (indexed) Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 09/12] target/arm: Implement bfloat widening fma (vector) Richard Henderson
` (4 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Maydell, qemu-arm
This is BFMMLA for both AArch64 AdvSIMD and SVE,
and VMMLA.BF16 for AArch32 NEON.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 3 +++
target/arm/neon-shared.decode | 2 ++
target/arm/sve.decode | 6 +++--
target/arm/translate-a64.c | 10 +++++++++
target/arm/translate-neon.c | 9 ++++++++
target/arm/translate-sve.c | 12 ++++++++++
target/arm/vec_helper.c | 42 ++++++++++++++++++++++++++++++++++-
7 files changed, 81 insertions(+), 3 deletions(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 376c1cef0f..af75d7f25f 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1007,6 +1007,9 @@ DEF_HELPER_FLAGS_5(gvec_bfdot, TCG_CALL_NO_RWG,
DEF_HELPER_FLAGS_5(gvec_bfdot_idx, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_5(gvec_bfmmla, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, i32)
+
#ifdef TARGET_AARCH64
#include "helper-a64.h"
#include "helper-sve.h"
diff --git a/target/arm/neon-shared.decode b/target/arm/neon-shared.decode
index fa3cf14e3a..4e0a25d27c 100644
--- a/target/arm/neon-shared.decode
+++ b/target/arm/neon-shared.decode
@@ -67,6 +67,8 @@ VUMMLA 1111 1100 0.10 .... .... 1100 .1.1 .... \
vm=%vm_dp vn=%vn_dp vd=%vd_dp
VUSMMLA 1111 1100 1.10 .... .... 1100 .1.0 .... \
vm=%vm_dp vn=%vn_dp vd=%vd_dp
+VMMLA_b16 1111 1100 0.00 .... .... 1100 .1.0 .... \
+ vm=%vm_dp vn=%vn_dp vd=%vd_dp
VCMLA_scalar 1111 1110 0 . rot:2 .... .... 1000 . q:1 index:1 0 vm:4 \
vn=%vn_dp vd=%vd_dp size=1
diff --git a/target/arm/sve.decode b/target/arm/sve.decode
index 51f87e8937..6c17898dee 100644
--- a/target/arm/sve.decode
+++ b/target/arm/sve.decode
@@ -1568,8 +1568,10 @@ SQRDCMLAH_zzzz 01000100 esz:2 0 rm:5 0011 rot:2 rn:5 rd:5 ra=%reg_movprfx
USDOT_zzzz 01000100 .. 0 ..... 011 110 ..... ..... @rda_rn_rm
### SVE2 floating point matrix multiply accumulate
-
-FMMLA 01100100 .. 1 ..... 111001 ..... ..... @rda_rn_rm
+{
+ BFMMLA 01100100 01 1 ..... 111 001 ..... ..... @rda_rn_rm_e0
+ FMMLA 01100100 .. 1 ..... 111 001 ..... ..... @rda_rn_rm
+}
### SVE2 Memory Gather Load Group
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 1df931cef5..9f8dae90ba 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -12228,6 +12228,13 @@ static void disas_simd_three_reg_same_extra(DisasContext *s, uint32_t insn)
}
feature = dc_isar_feature(aa64_fcma, s);
break;
+ case 0x1d: /* BFMMLA */
+ if (size != MO_16 || !is_q) {
+ unallocated_encoding(s);
+ return;
+ }
+ feature = dc_isar_feature(aa64_bf16, s);
+ break;
case 0x1f: /* BFDOT */
switch (size) {
case 1:
@@ -12321,6 +12328,9 @@ static void disas_simd_three_reg_same_extra(DisasContext *s, uint32_t insn)
}
return;
+ case 0xd: /* BFMMLA */
+ gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, 0, gen_helper_gvec_bfmmla);
+ return;
case 0xf: /* BFDOT */
switch (size) {
case 1:
diff --git a/target/arm/translate-neon.c b/target/arm/translate-neon.c
index 8099767792..9d227a1e13 100644
--- a/target/arm/translate-neon.c
+++ b/target/arm/translate-neon.c
@@ -4126,3 +4126,12 @@ static bool trans_VUSMMLA(DisasContext *s, arg_VUSMMLA *a)
return do_neon_ddda(s, 7, a->vd, a->vn, a->vm, 0,
gen_helper_gvec_usmmla_b);
}
+
+static bool trans_VMMLA_b16(DisasContext *s, arg_VMMLA_b16 *a)
+{
+ if (!dc_isar_feature(aa32_bf16, s)) {
+ return false;
+ }
+ return do_neon_ddda(s, 7, a->vd, a->vn, a->vm, 0,
+ gen_helper_gvec_bfmmla);
+}
diff --git a/target/arm/translate-sve.c b/target/arm/translate-sve.c
index 6f02030635..4f575dc334 100644
--- a/target/arm/translate-sve.c
+++ b/target/arm/translate-sve.c
@@ -8677,3 +8677,15 @@ static bool trans_BFDOT_zzxz(DisasContext *s, arg_rrxr_esz *a)
}
return true;
}
+
+static bool trans_BFMMLA(DisasContext *s, arg_rrrr_esz *a)
+{
+ if (!dc_isar_feature(aa64_sve_bf16, s)) {
+ return false;
+ }
+ if (sve_access_check(s)) {
+ gen_gvec_ool_zzzz(s, gen_helper_gvec_bfmmla,
+ a->rd, a->rn, a->rm, a->ra, 0);
+ }
+ return true;
+}
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index 74a497f38c..27e9bdd329 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -2385,7 +2385,7 @@ static void do_mmla_b(void *vd, void *vn, void *vm, void *va, uint32_t desc,
* Process the entire segment at once, writing back the
* results only after we've consumed all of the inputs.
*
- * Key to indicies by column:
+ * Key to indices by column:
* i j i j
*/
sum0 = a[H4(0 + 0)];
@@ -2472,3 +2472,43 @@ void HELPER(gvec_bfdot_idx)(void *vd, void *vn, void *vm,
}
clear_tail(d, opr_sz, simd_maxsz(desc));
}
+
+void HELPER(gvec_bfmmla)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
+{
+ intptr_t s, opr_sz = simd_oprsz(desc);
+ float32 *d = vd, *a = va;
+ uint32_t *n = vn, *m = vm;
+
+ for (s = 0; s < opr_sz / 4; s += 4) {
+ float32 sum00, sum01, sum10, sum11;
+
+ /*
+ * Process the entire segment at once, writing back the
+ * results only after we've consumed all of the inputs.
+ *
+ * Key to indicies by column:
+ * i j i k j k
+ */
+ sum00 = a[s + H4(0 + 0)];
+ sum00 = bfdotadd(sum00, n[s + H4(0 + 0)], m[s + H4(0 + 0)]);
+ sum00 = bfdotadd(sum00, n[s + H4(0 + 1)], m[s + H4(0 + 1)]);
+
+ sum01 = a[s + H4(0 + 1)];
+ sum01 = bfdotadd(sum01, n[s + H4(0 + 0)], m[s + H4(2 + 0)]);
+ sum01 = bfdotadd(sum01, n[s + H4(0 + 1)], m[s + H4(2 + 1)]);
+
+ sum10 = a[s + H4(2 + 0)];
+ sum10 = bfdotadd(sum10, n[s + H4(2 + 0)], m[s + H4(0 + 0)]);
+ sum10 = bfdotadd(sum10, n[s + H4(2 + 1)], m[s + H4(0 + 1)]);
+
+ sum11 = a[s + H4(2 + 1)];
+ sum11 = bfdotadd(sum11, n[s + H4(2 + 0)], m[s + H4(2 + 0)]);
+ sum11 = bfdotadd(sum11, n[s + H4(2 + 1)], m[s + H4(2 + 1)]);
+
+ d[s + H4(0 + 0)] = sum00;
+ d[s + H4(0 + 1)] = sum01;
+ d[s + H4(2 + 0)] = sum10;
+ d[s + H4(2 + 1)] = sum11;
+ }
+ clear_tail(d, opr_sz, simd_maxsz(desc));
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 09/12] target/arm: Implement bfloat widening fma (vector)
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (7 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 08/12] target/arm: Implement bfloat16 matrix multiply accumulate Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 10/12] target/arm: Implement bfloat widening fma (indexed) Richard Henderson
` (3 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Maydell, qemu-arm
This is BFMLAL{B,T} for both AArch64 AdvSIMD and SVE,
and VFMA{B,T}.BF16 for AArch32 NEON.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 3 +++
target/arm/neon-shared.decode | 3 +++
target/arm/sve.decode | 3 +++
target/arm/translate-a64.c | 13 +++++++++----
target/arm/translate-neon.c | 9 +++++++++
target/arm/translate-sve.c | 30 ++++++++++++++++++++++++++++++
target/arm/vec_helper.c | 16 ++++++++++++++++
7 files changed, 73 insertions(+), 4 deletions(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index af75d7f25f..36b3c9dd2d 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1010,6 +1010,9 @@ DEF_HELPER_FLAGS_5(gvec_bfdot_idx, TCG_CALL_NO_RWG,
DEF_HELPER_FLAGS_5(gvec_bfmmla, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_6(gvec_bfmlal, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, ptr, i32)
+
#ifdef TARGET_AARCH64
#include "helper-a64.h"
#include "helper-sve.h"
diff --git a/target/arm/neon-shared.decode b/target/arm/neon-shared.decode
index 4e0a25d27c..b61addd98b 100644
--- a/target/arm/neon-shared.decode
+++ b/target/arm/neon-shared.decode
@@ -70,6 +70,9 @@ VUSMMLA 1111 1100 1.10 .... .... 1100 .1.0 .... \
VMMLA_b16 1111 1100 0.00 .... .... 1100 .1.0 .... \
vm=%vm_dp vn=%vn_dp vd=%vd_dp
+VFMA_b16 1111 110 0 0.11 .... .... 1000 . q:1 . 1 .... \
+ vm=%vm_dp vn=%vn_dp vd=%vd_dp
+
VCMLA_scalar 1111 1110 0 . rot:2 .... .... 1000 . q:1 index:1 0 vm:4 \
vn=%vn_dp vd=%vd_dp size=1
VCMLA_scalar 1111 1110 1 . rot:2 .... .... 1000 . q:1 . 0 .... \
diff --git a/target/arm/sve.decode b/target/arm/sve.decode
index 6c17898dee..5281164eea 100644
--- a/target/arm/sve.decode
+++ b/target/arm/sve.decode
@@ -1627,6 +1627,9 @@ FMLALT_zzzw 01100100 10 1 ..... 10 0 00 1 ..... ..... @rda_rn_rm_e0
FMLSLB_zzzw 01100100 10 1 ..... 10 1 00 0 ..... ..... @rda_rn_rm_e0
FMLSLT_zzzw 01100100 10 1 ..... 10 1 00 1 ..... ..... @rda_rn_rm_e0
+BFMLALB_zzzw 01100100 11 1 ..... 10 0 00 0 ..... ..... @rda_rn_rm_e0
+BFMLALT_zzzw 01100100 11 1 ..... 10 0 00 1 ..... ..... @rda_rn_rm_e0
+
### SVE2 floating-point bfloat16 dot-product
BFDOT_zzzz 01100100 01 1 ..... 10 0 00 0 ..... ..... @rda_rn_rm_e0
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 9f8dae90ba..2a99e015ca 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -12235,9 +12235,10 @@ static void disas_simd_three_reg_same_extra(DisasContext *s, uint32_t insn)
}
feature = dc_isar_feature(aa64_bf16, s);
break;
- case 0x1f: /* BFDOT */
+ case 0x1f:
switch (size) {
- case 1:
+ case 1: /* BFDOT */
+ case 3: /* BFMLAL{B,T} */
feature = dc_isar_feature(aa64_bf16, s);
break;
default:
@@ -12331,11 +12332,15 @@ static void disas_simd_three_reg_same_extra(DisasContext *s, uint32_t insn)
case 0xd: /* BFMMLA */
gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, 0, gen_helper_gvec_bfmmla);
return;
- case 0xf: /* BFDOT */
+ case 0xf:
switch (size) {
- case 1:
+ case 1: /* BFDOT */
gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, 0, gen_helper_gvec_bfdot);
break;
+ case 3: /* BFMLAL{B,T} */
+ gen_gvec_op4_fpst(s, 1, rd, rn, rm, rd, false, is_q,
+ gen_helper_gvec_bfmlal);
+ break;
default:
g_assert_not_reached();
}
diff --git a/target/arm/translate-neon.c b/target/arm/translate-neon.c
index 9d227a1e13..4d0c2494dc 100644
--- a/target/arm/translate-neon.c
+++ b/target/arm/translate-neon.c
@@ -4135,3 +4135,12 @@ static bool trans_VMMLA_b16(DisasContext *s, arg_VMMLA_b16 *a)
return do_neon_ddda(s, 7, a->vd, a->vn, a->vm, 0,
gen_helper_gvec_bfmmla);
}
+
+static bool trans_VFMA_b16(DisasContext *s, arg_VFMA_b16 *a)
+{
+ if (!dc_isar_feature(aa32_bf16, s)) {
+ return false;
+ }
+ return do_neon_ddda_fpst(s, 7, a->vd, a->vn, a->vm, a->q, FPST_STD,
+ gen_helper_gvec_bfmlal);
+}
diff --git a/target/arm/translate-sve.c b/target/arm/translate-sve.c
index 4f575dc334..ba8f5d7b7d 100644
--- a/target/arm/translate-sve.c
+++ b/target/arm/translate-sve.c
@@ -8689,3 +8689,33 @@ static bool trans_BFMMLA(DisasContext *s, arg_rrrr_esz *a)
}
return true;
}
+
+static bool do_BFMLAL_zzzw(DisasContext *s, arg_rrrr_esz *a, bool sel)
+{
+ if (!dc_isar_feature(aa64_sve_bf16, s)) {
+ return false;
+ }
+ if (sve_access_check(s)) {
+ TCGv_ptr status = fpstatus_ptr(FPST_FPCR);
+ unsigned vsz = vec_full_reg_size(s);
+
+ tcg_gen_gvec_4_ptr(vec_full_reg_offset(s, a->rd),
+ vec_full_reg_offset(s, a->rn),
+ vec_full_reg_offset(s, a->rm),
+ vec_full_reg_offset(s, a->ra),
+ status, vsz, vsz, sel,
+ gen_helper_gvec_bfmlal);
+ tcg_temp_free_ptr(status);
+ }
+ return true;
+}
+
+static bool trans_BFMLALB_zzzw(DisasContext *s, arg_rrrr_esz *a)
+{
+ return do_BFMLAL_zzzw(s, a, false);
+}
+
+static bool trans_BFMLALT_zzzw(DisasContext *s, arg_rrrr_esz *a)
+{
+ return do_BFMLAL_zzzw(s, a, true);
+}
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index 27e9bdd329..d82736b5e6 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -2512,3 +2512,19 @@ void HELPER(gvec_bfmmla)(void *vd, void *vn, void *vm, void *va, uint32_t desc)
}
clear_tail(d, opr_sz, simd_maxsz(desc));
}
+
+void HELPER(gvec_bfmlal)(void *vd, void *vn, void *vm, void *va,
+ void *stat, uint32_t desc)
+{
+ intptr_t i, opr_sz = simd_oprsz(desc);
+ intptr_t sel = simd_data(desc);
+ float32 *d = vd, *a = va;
+ bfloat16 *n = vn, *m = vm;
+
+ for (i = 0; i < opr_sz / 4; ++i) {
+ float32 nn = n[H2(i * 2 + sel)] << 16;
+ float32 mm = m[H2(i * 2 + sel)] << 16;
+ d[H4(i)] = float32_muladd(nn, mm, a[H4(i)], 0, stat);
+ }
+ clear_tail(d, opr_sz, simd_maxsz(desc));
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 10/12] target/arm: Implement bfloat widening fma (indexed)
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (8 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 09/12] target/arm: Implement bfloat widening fma (vector) Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 11/12] linux-user/aarch64: Enable hwcap bits for bfloat16 Richard Henderson
` (2 subsequent siblings)
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: Peter Maydell, qemu-arm
This is BFMLAL{B,T} for both AArch64 AdvSIMD and SVE,
and VFMA{B,T}.BF16 for AArch32 NEON.
Reviewed-by: Peter Maydell <peter.maydell@linaro.org>
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/helper.h | 2 ++
target/arm/neon-shared.decode | 2 ++
target/arm/sve.decode | 2 ++
target/arm/translate-a64.c | 15 ++++++++++++++-
target/arm/translate-neon.c | 10 ++++++++++
target/arm/translate-sve.c | 30 ++++++++++++++++++++++++++++++
target/arm/vec_helper.c | 22 ++++++++++++++++++++++
7 files changed, 82 insertions(+), 1 deletion(-)
diff --git a/target/arm/helper.h b/target/arm/helper.h
index 36b3c9dd2d..dc6eb96d43 100644
--- a/target/arm/helper.h
+++ b/target/arm/helper.h
@@ -1012,6 +1012,8 @@ DEF_HELPER_FLAGS_5(gvec_bfmmla, TCG_CALL_NO_RWG,
DEF_HELPER_FLAGS_6(gvec_bfmlal, TCG_CALL_NO_RWG,
void, ptr, ptr, ptr, ptr, ptr, i32)
+DEF_HELPER_FLAGS_6(gvec_bfmlal_idx, TCG_CALL_NO_RWG,
+ void, ptr, ptr, ptr, ptr, ptr, i32)
#ifdef TARGET_AARCH64
#include "helper-a64.h"
diff --git a/target/arm/neon-shared.decode b/target/arm/neon-shared.decode
index b61addd98b..df80e6ebf6 100644
--- a/target/arm/neon-shared.decode
+++ b/target/arm/neon-shared.decode
@@ -95,3 +95,5 @@ VFML_scalar 1111 1110 0 . 0 s:1 .... .... 1000 . 0 . 1 index:1 ... \
rm=%vfml_scalar_q0_rm vn=%vn_sp vd=%vd_dp q=0
VFML_scalar 1111 1110 0 . 0 s:1 .... .... 1000 . 1 . 1 . rm:3 \
index=%vfml_scalar_q1_index vn=%vn_dp vd=%vd_dp q=1
+VFMA_b16_scal 1111 1110 0.11 .... .... 1000 . q:1 . 1 . vm:3 \
+ index=%vfml_scalar_q1_index vn=%vn_dp vd=%vd_dp
diff --git a/target/arm/sve.decode b/target/arm/sve.decode
index 5281164eea..a62c169f1a 100644
--- a/target/arm/sve.decode
+++ b/target/arm/sve.decode
@@ -1638,6 +1638,8 @@ FMLALB_zzxw 01100100 10 1 ..... 0100.0 ..... ..... @rrxr_3a esz=2
FMLALT_zzxw 01100100 10 1 ..... 0100.1 ..... ..... @rrxr_3a esz=2
FMLSLB_zzxw 01100100 10 1 ..... 0110.0 ..... ..... @rrxr_3a esz=2
FMLSLT_zzxw 01100100 10 1 ..... 0110.1 ..... ..... @rrxr_3a esz=2
+BFMLALB_zzxw 01100100 11 1 ..... 0100.0 ..... ..... @rrxr_3a esz=2
+BFMLALT_zzxw 01100100 11 1 ..... 0100.1 ..... ..... @rrxr_3a esz=2
### SVE2 floating-point bfloat16 dot-product (indexed)
BFDOT_zzxz 01100100 01 1 ..... 010000 ..... ..... @rrxr_2 esz=2
diff --git a/target/arm/translate-a64.c b/target/arm/translate-a64.c
index 2a99e015ca..d1dc9401d5 100644
--- a/target/arm/translate-a64.c
+++ b/target/arm/translate-a64.c
@@ -13465,18 +13465,27 @@ static void disas_simd_indexed(DisasContext *s, uint32_t insn)
unallocated_encoding(s);
return;
}
+ size = MO_32;
break;
case 1: /* BFDOT */
if (is_scalar || !dc_isar_feature(aa64_bf16, s)) {
unallocated_encoding(s);
return;
}
+ size = MO_32;
+ break;
+ case 3: /* BFMLAL{B,T} */
+ if (is_scalar || !dc_isar_feature(aa64_bf16, s)) {
+ unallocated_encoding(s);
+ return;
+ }
+ /* can't set is_fp without other incorrect size checks */
+ size = MO_16;
break;
default:
unallocated_encoding(s);
return;
}
- size = MO_32;
break;
case 0x11: /* FCMLA #0 */
case 0x13: /* FCMLA #90 */
@@ -13606,6 +13615,10 @@ static void disas_simd_indexed(DisasContext *s, uint32_t insn)
gen_gvec_op4_ool(s, is_q, rd, rn, rm, rd, index,
gen_helper_gvec_usdot_idx_b);
return;
+ case 3: /* BFMLAL{B,T} */
+ gen_gvec_op4_fpst(s, 1, rd, rn, rm, rd, 0, (index << 1) | is_q,
+ gen_helper_gvec_bfmlal_idx);
+ return;
}
g_assert_not_reached();
case 0x11: /* FCMLA #0 */
diff --git a/target/arm/translate-neon.c b/target/arm/translate-neon.c
index 4d0c2494dc..633fef3bf7 100644
--- a/target/arm/translate-neon.c
+++ b/target/arm/translate-neon.c
@@ -4144,3 +4144,13 @@ static bool trans_VFMA_b16(DisasContext *s, arg_VFMA_b16 *a)
return do_neon_ddda_fpst(s, 7, a->vd, a->vn, a->vm, a->q, FPST_STD,
gen_helper_gvec_bfmlal);
}
+
+static bool trans_VFMA_b16_scal(DisasContext *s, arg_VFMA_b16_scal *a)
+{
+ if (!dc_isar_feature(aa32_bf16, s)) {
+ return false;
+ }
+ return do_neon_ddda_fpst(s, 6, a->vd, a->vn, a->vm,
+ (a->index << 1) | a->q, FPST_STD,
+ gen_helper_gvec_bfmlal_idx);
+}
diff --git a/target/arm/translate-sve.c b/target/arm/translate-sve.c
index ba8f5d7b7d..46210eb696 100644
--- a/target/arm/translate-sve.c
+++ b/target/arm/translate-sve.c
@@ -8719,3 +8719,33 @@ static bool trans_BFMLALT_zzzw(DisasContext *s, arg_rrrr_esz *a)
{
return do_BFMLAL_zzzw(s, a, true);
}
+
+static bool do_BFMLAL_zzxw(DisasContext *s, arg_rrxr_esz *a, bool sel)
+{
+ if (!dc_isar_feature(aa64_sve_bf16, s)) {
+ return false;
+ }
+ if (sve_access_check(s)) {
+ TCGv_ptr status = fpstatus_ptr(FPST_FPCR);
+ unsigned vsz = vec_full_reg_size(s);
+
+ tcg_gen_gvec_4_ptr(vec_full_reg_offset(s, a->rd),
+ vec_full_reg_offset(s, a->rn),
+ vec_full_reg_offset(s, a->rm),
+ vec_full_reg_offset(s, a->ra),
+ status, vsz, vsz, (a->index << 1) | sel,
+ gen_helper_gvec_bfmlal_idx);
+ tcg_temp_free_ptr(status);
+ }
+ return true;
+}
+
+static bool trans_BFMLALB_zzxw(DisasContext *s, arg_rrxr_esz *a)
+{
+ return do_BFMLAL_zzxw(s, a, false);
+}
+
+static bool trans_BFMLALT_zzxw(DisasContext *s, arg_rrxr_esz *a)
+{
+ return do_BFMLAL_zzxw(s, a, true);
+}
diff --git a/target/arm/vec_helper.c b/target/arm/vec_helper.c
index d82736b5e6..5862f187cd 100644
--- a/target/arm/vec_helper.c
+++ b/target/arm/vec_helper.c
@@ -2528,3 +2528,25 @@ void HELPER(gvec_bfmlal)(void *vd, void *vn, void *vm, void *va,
}
clear_tail(d, opr_sz, simd_maxsz(desc));
}
+
+void HELPER(gvec_bfmlal_idx)(void *vd, void *vn, void *vm,
+ void *va, void *stat, uint32_t desc)
+{
+ intptr_t i, j, opr_sz = simd_oprsz(desc);
+ intptr_t sel = extract32(desc, SIMD_DATA_SHIFT, 1);
+ intptr_t index = extract32(desc, SIMD_DATA_SHIFT + 1, 3);
+ intptr_t elements = opr_sz / 4;
+ intptr_t eltspersegment = MIN(16 / 4, elements);
+ float32 *d = vd, *a = va;
+ bfloat16 *n = vn, *m = vm;
+
+ for (i = 0; i < elements; i += eltspersegment) {
+ float32 m_idx = m[H2(2 * i + index)] << 16;
+
+ for (j = i; j < i + eltspersegment; j++) {
+ float32 n_j = n[H2(2 * j + sel)] << 16;
+ d[H4(j)] = float32_muladd(n_j, m_idx, a[H4(j)], 0, stat);
+ }
+ }
+ clear_tail(d, opr_sz, simd_maxsz(desc));
+}
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 11/12] linux-user/aarch64: Enable hwcap bits for bfloat16
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (9 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 10/12] target/arm: Implement bfloat widening fma (indexed) Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-05-25 22:58 ` [PATCH v2 12/12] target/arm: Enable BFloat16 extensions Richard Henderson
2021-06-03 12:57 ` [PATCH v2 00/12] target/arm: Implement BFloat16 Peter Maydell
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
linux-user/elfload.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/linux-user/elfload.c b/linux-user/elfload.c
index 1ab97e38e0..17ab06f612 100644
--- a/linux-user/elfload.c
+++ b/linux-user/elfload.c
@@ -659,7 +659,9 @@ static uint32_t get_elf_hwcap2(void)
GET_FEATURE_ID(aa64_sve_i8mm, ARM_HWCAP2_A64_SVEI8MM);
GET_FEATURE_ID(aa64_sve_f32mm, ARM_HWCAP2_A64_SVEF32MM);
GET_FEATURE_ID(aa64_sve_f64mm, ARM_HWCAP2_A64_SVEF64MM);
+ GET_FEATURE_ID(aa64_sve_bf16, ARM_HWCAP2_A64_SVEBF16);
GET_FEATURE_ID(aa64_i8mm, ARM_HWCAP2_A64_I8MM);
+ GET_FEATURE_ID(aa64_bf16, ARM_HWCAP2_A64_BF16);
GET_FEATURE_ID(aa64_rndr, ARM_HWCAP2_A64_RNG);
GET_FEATURE_ID(aa64_bti, ARM_HWCAP2_A64_BTI);
GET_FEATURE_ID(aa64_mte, ARM_HWCAP2_A64_MTE);
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* [PATCH v2 12/12] target/arm: Enable BFloat16 extensions
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (10 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 11/12] linux-user/aarch64: Enable hwcap bits for bfloat16 Richard Henderson
@ 2021-05-25 22:58 ` Richard Henderson
2021-06-03 12:57 ` [PATCH v2 00/12] target/arm: Implement BFloat16 Peter Maydell
12 siblings, 0 replies; 14+ messages in thread
From: Richard Henderson @ 2021-05-25 22:58 UTC (permalink / raw)
To: qemu-devel; +Cc: qemu-arm
Disable BF16 again for !have_neon and !have_vfp during realize.
Signed-off-by: Richard Henderson <richard.henderson@linaro.org>
---
target/arm/cpu.c | 3 +++
target/arm/cpu64.c | 3 +++
target/arm/cpu_tcg.c | 1 +
3 files changed, 7 insertions(+)
diff --git a/target/arm/cpu.c b/target/arm/cpu.c
index 7aeb4b1381..cfc03c550b 100644
--- a/target/arm/cpu.c
+++ b/target/arm/cpu.c
@@ -1463,6 +1463,7 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
u = cpu->isar.id_isar6;
u = FIELD_DP32(u, ID_ISAR6, JSCVT, 0);
+ u = FIELD_DP32(u, ID_ISAR6, BF16, 0);
cpu->isar.id_isar6 = u;
u = cpu->isar.mvfr0;
@@ -1503,6 +1504,7 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
t = cpu->isar.id_aa64isar1;
t = FIELD_DP64(t, ID_AA64ISAR1, FCMA, 0);
+ t = FIELD_DP64(t, ID_AA64ISAR1, BF16, 0);
t = FIELD_DP64(t, ID_AA64ISAR1, I8MM, 0);
cpu->isar.id_aa64isar1 = t;
@@ -1518,6 +1520,7 @@ static void arm_cpu_realizefn(DeviceState *dev, Error **errp)
u = cpu->isar.id_isar6;
u = FIELD_DP32(u, ID_ISAR6, DP, 0);
u = FIELD_DP32(u, ID_ISAR6, FHM, 0);
+ u = FIELD_DP32(u, ID_ISAR6, BF16, 0);
u = FIELD_DP32(u, ID_ISAR6, I8MM, 0);
cpu->isar.id_isar6 = u;
diff --git a/target/arm/cpu64.c b/target/arm/cpu64.c
index d561dc7acc..1c23187d1a 100644
--- a/target/arm/cpu64.c
+++ b/target/arm/cpu64.c
@@ -661,6 +661,7 @@ static void aarch64_max_initfn(Object *obj)
t = FIELD_DP64(t, ID_AA64ISAR1, FCMA, 1);
t = FIELD_DP64(t, ID_AA64ISAR1, SB, 1);
t = FIELD_DP64(t, ID_AA64ISAR1, SPECRES, 1);
+ t = FIELD_DP64(t, ID_AA64ISAR1, BF16, 1);
t = FIELD_DP64(t, ID_AA64ISAR1, FRINTTS, 1);
t = FIELD_DP64(t, ID_AA64ISAR1, LRCPC, 2); /* ARMv8.4-RCPC */
t = FIELD_DP64(t, ID_AA64ISAR1, I8MM, 1);
@@ -708,6 +709,7 @@ static void aarch64_max_initfn(Object *obj)
t = FIELD_DP64(t, ID_AA64ZFR0, SVEVER, 1);
t = FIELD_DP64(t, ID_AA64ZFR0, AES, 2); /* PMULL */
t = FIELD_DP64(t, ID_AA64ZFR0, BITPERM, 1);
+ t = FIELD_DP64(t, ID_AA64ZFR0, BFLOAT16, 1);
t = FIELD_DP64(t, ID_AA64ZFR0, SHA3, 1);
t = FIELD_DP64(t, ID_AA64ZFR0, SM4, 1);
t = FIELD_DP64(t, ID_AA64ZFR0, I8MM, 1);
@@ -731,6 +733,7 @@ static void aarch64_max_initfn(Object *obj)
u = FIELD_DP32(u, ID_ISAR6, FHM, 1);
u = FIELD_DP32(u, ID_ISAR6, SB, 1);
u = FIELD_DP32(u, ID_ISAR6, SPECRES, 1);
+ u = FIELD_DP32(u, ID_ISAR6, BF16, 1);
u = FIELD_DP32(u, ID_ISAR6, I8MM, 1);
cpu->isar.id_isar6 = u;
diff --git a/target/arm/cpu_tcg.c b/target/arm/cpu_tcg.c
index d3458335ed..c8a12bc2d6 100644
--- a/target/arm/cpu_tcg.c
+++ b/target/arm/cpu_tcg.c
@@ -968,6 +968,7 @@ static void arm_max_initfn(Object *obj)
t = FIELD_DP32(t, ID_ISAR6, FHM, 1);
t = FIELD_DP32(t, ID_ISAR6, SB, 1);
t = FIELD_DP32(t, ID_ISAR6, SPECRES, 1);
+ t = FIELD_DP32(t, ID_ISAR6, BF16, 1);
t = FIELD_DP32(t, ID_ISAR6, I8MM, 1);
cpu->isar.id_isar6 = t;
--
2.25.1
^ permalink raw reply related [flat|nested] 14+ messages in thread
* Re: [PATCH v2 00/12] target/arm: Implement BFloat16
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
` (11 preceding siblings ...)
2021-05-25 22:58 ` [PATCH v2 12/12] target/arm: Enable BFloat16 extensions Richard Henderson
@ 2021-06-03 12:57 ` Peter Maydell
12 siblings, 0 replies; 14+ messages in thread
From: Peter Maydell @ 2021-06-03 12:57 UTC (permalink / raw)
To: Richard Henderson; +Cc: qemu-arm, QEMU Developers
On Tue, 25 May 2021 at 23:58, Richard Henderson
<richard.henderson@linaro.org> wrote:
>
> Whee! All prerequisites are now merged.
>
> Patches missing r-b:
>
> 05-softfpu-Add-float_round_to_odd_inf.patch
> 06-target-arm-Implement-bfloat16-dot-product-vector.patch
> 07-target-arm-Implement-bfloat16-dot-product-indexed.patch
> 11-linux-user-aarch64-Enable-hwcap-bits-for-bfloat16.patch
> 12-target-arm-Enable-BFloat16-extensions.patch
>
> Per the question of whether additional checks vs VFP or NEON
> are required, I have disabled BF16 when either VFP or NEON are
> also disabled. Which seems like a similarly reasonable choice.
Applied to target-arm.next, thanks.
-- PMM
^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2021-06-03 13:03 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-25 22:58 [PATCH v2 00/12] target/arm: Implement BFloat16 Richard Henderson
2021-05-25 22:58 ` [PATCH v2 01/12] target/arm: Add isar_feature_{aa32, aa64, aa64_sve}_bf16 Richard Henderson
2021-05-25 22:58 ` [PATCH v2 02/12] target/arm: Unify unallocated path in disas_fp_1src Richard Henderson
2021-05-25 22:58 ` [PATCH v2 03/12] target/arm: Implement scalar float32 to bfloat16 conversion Richard Henderson
2021-05-25 22:58 ` [PATCH v2 04/12] target/arm: Implement vector " Richard Henderson
2021-05-25 22:58 ` [PATCH v2 05/12] softfpu: Add float_round_to_odd_inf Richard Henderson
2021-05-25 22:58 ` [PATCH v2 06/12] target/arm: Implement bfloat16 dot product (vector) Richard Henderson
2021-05-25 22:58 ` [PATCH v2 07/12] target/arm: Implement bfloat16 dot product (indexed) Richard Henderson
2021-05-25 22:58 ` [PATCH v2 08/12] target/arm: Implement bfloat16 matrix multiply accumulate Richard Henderson
2021-05-25 22:58 ` [PATCH v2 09/12] target/arm: Implement bfloat widening fma (vector) Richard Henderson
2021-05-25 22:58 ` [PATCH v2 10/12] target/arm: Implement bfloat widening fma (indexed) Richard Henderson
2021-05-25 22:58 ` [PATCH v2 11/12] linux-user/aarch64: Enable hwcap bits for bfloat16 Richard Henderson
2021-05-25 22:58 ` [PATCH v2 12/12] target/arm: Enable BFloat16 extensions Richard Henderson
2021-06-03 12:57 ` [PATCH v2 00/12] target/arm: Implement BFloat16 Peter Maydell
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).