[RFC] risc-v vector (RVV) emulation performance issues

* [RFC] risc-v vector (RVV) emulation performance issues
@ 2023-07-24 13:40 Daniel Henrique Barboza
  2023-07-24 15:23 ` Philippe Mathieu-Daudé
  2023-07-25 18:53 ` Richard Henderson
  0 siblings, 2 replies; 3+ messages in thread
From: Daniel Henrique Barboza @ 2023-07-24 13:40 UTC (permalink / raw)
  To: qemu-devel, open list:RISC-V
  Cc: Palmer Dabbelt, Jeff Law, Richard Henderson, Alistair Francis

Hi,

As some of you are already aware the current RVV emulation could be faster.
We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
skip set tail when vta is zero") that tried to address at least part of the
problem.

Running a simple program like this:

-------

#define SZ 10000000

int main ()
{
   int *a = malloc (SZ * sizeof (int));
   int *b = malloc (SZ * sizeof (int));
   int *c = malloc (SZ * sizeof (int));

   for (int i = 0; i < SZ; i++)
     c[i] = a[i] + b[i];
   return c[SZ - 1];
}

-------

And then compiling it without RVV support will run in 50 milis or so:

$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo-novect.out

real	0m0.043s
user	0m0.025s
sys	0m0.018s

Building the same program with RVV support slows it 4-5 times:

$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out

real	0m0.196s
user	0m0.177s
sys	0m0.018s

Using the lowest 'vlen' val allowed (128) will slow down things even further, taking it to
~0.260s.

'perf record' shows the following profile on the aforementioned binary:

   23.27%  qemu-riscv64  qemu-riscv64             [.] do_ld4_mmu
   21.11%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
   14.05%  qemu-riscv64  qemu-riscv64             [.] cpu_ldl_le_data_ra
   11.51%  qemu-riscv64  qemu-riscv64             [.] cpu_stl_le_data_ra
    8.18%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
    8.04%  qemu-riscv64  qemu-riscv64             [.] do_st4_mmu
    2.04%  qemu-riscv64  qemu-riscv64             [.] ste_w
    1.15%  qemu-riscv64  qemu-riscv64             [.] lde_w
    1.02%  qemu-riscv64  [unknown]                [k] 0xffffffffb3001260
    0.90%  qemu-riscv64  qemu-riscv64             [.] cpu_get_tb_cpu_state
    0.64%  qemu-riscv64  qemu-riscv64             [.] tb_lookup
    0.64%  qemu-riscv64  qemu-riscv64             [.] riscv_cpu_mmu_index
    0.39%  qemu-riscv64  qemu-riscv64             [.] object_dynamic_cast_assert

First thing that caught my attention is vext_ldst_us from target/riscv/vector_helper.c:

     /* load bytes from guest memory */
     for (i = env->vstart; i < evl; i++, env->vstart++) {
         k = 0;
         while (k < nf) {
             target_ulong addr = base + ((i * nf + k) << log2_esz);
             ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
             k++;
         }
     }
     env->vstart = 0;

Given that this is a unit-stride load that access contiguous elements in memory it
seems that this loop could be optimized/removed since it's loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that ARM SVE might
have something of the sorts. Richard, care to comment?

The current support we have is good enough for booting a kernel and tests, but things
aggravate fast if one attempts to run a x264 SPEC with it. With a SPEC run we have
other insns appearing as hot but for now it would be good to see if we can optimize
these loads and stores.

Any ideas on how to tackle this? Thanks,

Daniel

^ permalink raw reply	[flat|nested] 3+ messages in thread