* [RFC] risc-v vector (RVV) emulation performance issues
@ 2023-07-24 13:40 Daniel Henrique Barboza
2023-07-24 15:23 ` Philippe Mathieu-Daudé
2023-07-25 18:53 ` Richard Henderson
0 siblings, 2 replies; 3+ messages in thread
From: Daniel Henrique Barboza @ 2023-07-24 13:40 UTC (permalink / raw)
To: qemu-devel, open list:RISC-V
Cc: Palmer Dabbelt, Jeff Law, Richard Henderson, Alistair Francis
Hi,
As some of you are already aware the current RVV emulation could be faster.
We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
skip set tail when vta is zero") that tried to address at least part of the
problem.
Running a simple program like this:
-------
#define SZ 10000000
int main ()
{
int *a = malloc (SZ * sizeof (int));
int *b = malloc (SZ * sizeof (int));
int *c = malloc (SZ * sizeof (int));
for (int i = 0; i < SZ; i++)
c[i] = a[i] + b[i];
return c[SZ - 1];
}
-------
And then compiling it without RVV support will run in 50 milis or so:
$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo-novect.out
real 0m0.043s
user 0m0.025s
sys 0m0.018s
Building the same program with RVV support slows it 4-5 times:
$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out
real 0m0.196s
user 0m0.177s
sys 0m0.018s
Using the lowest 'vlen' val allowed (128) will slow down things even further, taking it to
~0.260s.
'perf record' shows the following profile on the aforementioned binary:
23.27% qemu-riscv64 qemu-riscv64 [.] do_ld4_mmu
21.11% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us
14.05% qemu-riscv64 qemu-riscv64 [.] cpu_ldl_le_data_ra
11.51% qemu-riscv64 qemu-riscv64 [.] cpu_stl_le_data_ra
8.18% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup
8.04% qemu-riscv64 qemu-riscv64 [.] do_st4_mmu
2.04% qemu-riscv64 qemu-riscv64 [.] ste_w
1.15% qemu-riscv64 qemu-riscv64 [.] lde_w
1.02% qemu-riscv64 [unknown] [k] 0xffffffffb3001260
0.90% qemu-riscv64 qemu-riscv64 [.] cpu_get_tb_cpu_state
0.64% qemu-riscv64 qemu-riscv64 [.] tb_lookup
0.64% qemu-riscv64 qemu-riscv64 [.] riscv_cpu_mmu_index
0.39% qemu-riscv64 qemu-riscv64 [.] object_dynamic_cast_assert
First thing that caught my attention is vext_ldst_us from target/riscv/vector_helper.c:
/* load bytes from guest memory */
for (i = env->vstart; i < evl; i++, env->vstart++) {
k = 0;
while (k < nf) {
target_ulong addr = base + ((i * nf + k) << log2_esz);
ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
k++;
}
}
env->vstart = 0;
Given that this is a unit-stride load that access contiguous elements in memory it
seems that this loop could be optimized/removed since it's loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that ARM SVE might
have something of the sorts. Richard, care to comment?
The current support we have is good enough for booting a kernel and tests, but things
aggravate fast if one attempts to run a x264 SPEC with it. With a SPEC run we have
other insns appearing as hot but for now it would be good to see if we can optimize
these loads and stores.
Any ideas on how to tackle this? Thanks,
Daniel
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC] risc-v vector (RVV) emulation performance issues
2023-07-24 13:40 [RFC] risc-v vector (RVV) emulation performance issues Daniel Henrique Barboza
@ 2023-07-24 15:23 ` Philippe Mathieu-Daudé
2023-07-25 18:53 ` Richard Henderson
1 sibling, 0 replies; 3+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-07-24 15:23 UTC (permalink / raw)
To: Daniel Henrique Barboza, qemu-devel, open list:RISC-V
Cc: Palmer Dabbelt, Jeff Law, Richard Henderson, Alistair Francis
On 24/7/23 15:40, Daniel Henrique Barboza wrote:
> Hi,
>
> As some of you are already aware the current RVV emulation could be faster.
> We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
> skip set tail when vta is zero") that tried to address at least part of the
> problem.
> First thing that caught my attention is vext_ldst_us from
> target/riscv/vector_helper.c:
>
> /* load bytes from guest memory */
> for (i = env->vstart; i < evl; i++, env->vstart++) {
> k = 0;
> while (k < nf) {
> target_ulong addr = base + ((i * nf + k) << log2_esz);
> ldst_elem(env, adjust_addr(env, addr), i + k * max_elems,
> vd, ra);
> k++;
> }
> }
> env->vstart = 0;
>
> Given that this is a unit-stride load that access contiguous elements in
> memory it
> seems that this loop could be optimized/removed since it's
> loading/storing bytes
> one by one. I didn't find any TCG op to do that though. I assume that
> ARM SVE might
> have something of the sorts. Richard, care to comment?
Have you looked at the "tcg/tcg-op-gvec-common.h" API?
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [RFC] risc-v vector (RVV) emulation performance issues
2023-07-24 13:40 [RFC] risc-v vector (RVV) emulation performance issues Daniel Henrique Barboza
2023-07-24 15:23 ` Philippe Mathieu-Daudé
@ 2023-07-25 18:53 ` Richard Henderson
1 sibling, 0 replies; 3+ messages in thread
From: Richard Henderson @ 2023-07-25 18:53 UTC (permalink / raw)
To: Daniel Henrique Barboza, qemu-devel, open list:RISC-V
Cc: Palmer Dabbelt, Jeff Law, Alistair Francis
On 7/24/23 06:40, Daniel Henrique Barboza wrote:
> Hi,
>
> As some of you are already aware the current RVV emulation could be faster.
> We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
> skip set tail when vta is zero") that tried to address at least part of the
> problem.
>
> Running a simple program like this:
>
> -------
>
> #define SZ 10000000
>
> int main ()
> {
> int *a = malloc (SZ * sizeof (int));
> int *b = malloc (SZ * sizeof (int));
> int *c = malloc (SZ * sizeof (int));
>
> for (int i = 0; i < SZ; i++)
> c[i] = a[i] + b[i];
> return c[SZ - 1];
> }
>
> -------
>
> And then compiling it without RVV support will run in 50 milis or so:
>
> $ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128
> ./foo-novect.out
>
> real 0m0.043s
> user 0m0.025s
> sys 0m0.018s
>
> Building the same program with RVV support slows it 4-5 times:
>
> $ time ~/work/qemu/build/qemu-riscv64 -cpu
> rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out
>
> real 0m0.196s
> user 0m0.177s
> sys 0m0.018s
>
> Using the lowest 'vlen' val allowed (128) will slow down things even further, taking it to
> ~0.260s.
>
>
> 'perf record' shows the following profile on the aforementioned binary:
>
> 23.27% qemu-riscv64 qemu-riscv64 [.] do_ld4_mmu
> 21.11% qemu-riscv64 qemu-riscv64 [.] vext_ldst_us
> 14.05% qemu-riscv64 qemu-riscv64 [.] cpu_ldl_le_data_ra
> 11.51% qemu-riscv64 qemu-riscv64 [.] cpu_stl_le_data_ra
> 8.18% qemu-riscv64 qemu-riscv64 [.] cpu_mmu_lookup
> 8.04% qemu-riscv64 qemu-riscv64 [.] do_st4_mmu
> 2.04% qemu-riscv64 qemu-riscv64 [.] ste_w
> 1.15% qemu-riscv64 qemu-riscv64 [.] lde_w
> 1.02% qemu-riscv64 [unknown] [k] 0xffffffffb3001260
> 0.90% qemu-riscv64 qemu-riscv64 [.] cpu_get_tb_cpu_state
> 0.64% qemu-riscv64 qemu-riscv64 [.] tb_lookup
> 0.64% qemu-riscv64 qemu-riscv64 [.] riscv_cpu_mmu_index
> 0.39% qemu-riscv64 qemu-riscv64 [.] object_dynamic_cast_assert
>
>
> First thing that caught my attention is vext_ldst_us from target/riscv/vector_helper.c:
>
> /* load bytes from guest memory */
> for (i = env->vstart; i < evl; i++, env->vstart++) {
> k = 0;
> while (k < nf) {
> target_ulong addr = base + ((i * nf + k) << log2_esz);
> ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
> k++;
> }
> }
> env->vstart = 0;
>
> Given that this is a unit-stride load that access contiguous elements in memory it
> seems that this loop could be optimized/removed since it's loading/storing bytes
> one by one. I didn't find any TCG op to do that though. I assume that ARM SVE might
> have something of the sorts. Richard, care to comment?
Yes, SVE optimizes this case -- see
https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/sve_helper.c?ref_type=heads#L5651
It's not possible to do this generically, due to the predication. There's quite a lot of
machinery that goes into expanding this such that each helper uses the correct host
load/store insn in the fast case.
r~
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2023-07-25 18:54 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-24 13:40 [RFC] risc-v vector (RVV) emulation performance issues Daniel Henrique Barboza
2023-07-24 15:23 ` Philippe Mathieu-Daudé
2023-07-25 18:53 ` Richard Henderson
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.