All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC] risc-v vector (RVV) emulation performance issues
@ 2023-07-24 13:40 Daniel Henrique Barboza
  2023-07-24 15:23 ` Philippe Mathieu-Daudé
  2023-07-25 18:53 ` Richard Henderson
  0 siblings, 2 replies; 3+ messages in thread
From: Daniel Henrique Barboza @ 2023-07-24 13:40 UTC (permalink / raw)
  To: qemu-devel, open list:RISC-V
  Cc: Palmer Dabbelt, Jeff Law, Richard Henderson, Alistair Francis

Hi,

As some of you are already aware the current RVV emulation could be faster.
We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
skip set tail when vta is zero") that tried to address at least part of the
problem.

Running a simple program like this:

-------

#define SZ 10000000

int main ()
{
   int *a = malloc (SZ * sizeof (int));
   int *b = malloc (SZ * sizeof (int));
   int *c = malloc (SZ * sizeof (int));

   for (int i = 0; i < SZ; i++)
     c[i] = a[i] + b[i];
   return c[SZ - 1];
}

-------

And then compiling it without RVV support will run in 50 milis or so:

$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 ./foo-novect.out

real	0m0.043s
user	0m0.025s
sys	0m0.018s

Building the same program with RVV support slows it 4-5 times:

$ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out

real	0m0.196s
user	0m0.177s
sys	0m0.018s

Using the lowest 'vlen' val allowed (128) will slow down things even further, taking it to
~0.260s.


'perf record' shows the following profile on the aforementioned binary:

   23.27%  qemu-riscv64  qemu-riscv64             [.] do_ld4_mmu
   21.11%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
   14.05%  qemu-riscv64  qemu-riscv64             [.] cpu_ldl_le_data_ra
   11.51%  qemu-riscv64  qemu-riscv64             [.] cpu_stl_le_data_ra
    8.18%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
    8.04%  qemu-riscv64  qemu-riscv64             [.] do_st4_mmu
    2.04%  qemu-riscv64  qemu-riscv64             [.] ste_w
    1.15%  qemu-riscv64  qemu-riscv64             [.] lde_w
    1.02%  qemu-riscv64  [unknown]                [k] 0xffffffffb3001260
    0.90%  qemu-riscv64  qemu-riscv64             [.] cpu_get_tb_cpu_state
    0.64%  qemu-riscv64  qemu-riscv64             [.] tb_lookup
    0.64%  qemu-riscv64  qemu-riscv64             [.] riscv_cpu_mmu_index
    0.39%  qemu-riscv64  qemu-riscv64             [.] object_dynamic_cast_assert


First thing that caught my attention is vext_ldst_us from target/riscv/vector_helper.c:

     /* load bytes from guest memory */
     for (i = env->vstart; i < evl; i++, env->vstart++) {
         k = 0;
         while (k < nf) {
             target_ulong addr = base + ((i * nf + k) << log2_esz);
             ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
             k++;
         }
     }
     env->vstart = 0;

Given that this is a unit-stride load that access contiguous elements in memory it
seems that this loop could be optimized/removed since it's loading/storing bytes
one by one. I didn't find any TCG op to do that though. I assume that ARM SVE might
have something of the sorts. Richard, care to comment?

The current support we have is good enough for booting a kernel and tests, but things
aggravate fast if one attempts to run a x264 SPEC with it. With a SPEC run we have
other insns appearing as hot but for now it would be good to see if we can optimize
these loads and stores.


Any ideas on how to tackle this? Thanks,


Daniel



^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] risc-v vector (RVV) emulation performance issues
  2023-07-24 13:40 [RFC] risc-v vector (RVV) emulation performance issues Daniel Henrique Barboza
@ 2023-07-24 15:23 ` Philippe Mathieu-Daudé
  2023-07-25 18:53 ` Richard Henderson
  1 sibling, 0 replies; 3+ messages in thread
From: Philippe Mathieu-Daudé @ 2023-07-24 15:23 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel, open list:RISC-V
  Cc: Palmer Dabbelt, Jeff Law, Richard Henderson, Alistair Francis

On 24/7/23 15:40, Daniel Henrique Barboza wrote:
> Hi,
> 
> As some of you are already aware the current RVV emulation could be faster.
> We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
> skip set tail when vta is zero") that tried to address at least part of the
> problem.


> First thing that caught my attention is vext_ldst_us from 
> target/riscv/vector_helper.c:
> 
>      /* load bytes from guest memory */
>      for (i = env->vstart; i < evl; i++, env->vstart++) {
>          k = 0;
>          while (k < nf) {
>              target_ulong addr = base + ((i * nf + k) << log2_esz);
>              ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, 
> vd, ra);
>              k++;
>          }
>      }
>      env->vstart = 0;
> 
> Given that this is a unit-stride load that access contiguous elements in 
> memory it
> seems that this loop could be optimized/removed since it's 
> loading/storing bytes
> one by one. I didn't find any TCG op to do that though. I assume that 
> ARM SVE might
> have something of the sorts. Richard, care to comment?

Have you looked at the "tcg/tcg-op-gvec-common.h" API?


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [RFC] risc-v vector (RVV) emulation performance issues
  2023-07-24 13:40 [RFC] risc-v vector (RVV) emulation performance issues Daniel Henrique Barboza
  2023-07-24 15:23 ` Philippe Mathieu-Daudé
@ 2023-07-25 18:53 ` Richard Henderson
  1 sibling, 0 replies; 3+ messages in thread
From: Richard Henderson @ 2023-07-25 18:53 UTC (permalink / raw)
  To: Daniel Henrique Barboza, qemu-devel, open list:RISC-V
  Cc: Palmer Dabbelt, Jeff Law, Alistair Francis

On 7/24/23 06:40, Daniel Henrique Barboza wrote:
> Hi,
> 
> As some of you are already aware the current RVV emulation could be faster.
> We have at least one commit (bc0ec52eb2, "target/riscv/vector_helper.c:
> skip set tail when vta is zero") that tried to address at least part of the
> problem.
> 
> Running a simple program like this:
> 
> -------
> 
> #define SZ 10000000
> 
> int main ()
> {
>    int *a = malloc (SZ * sizeof (int));
>    int *b = malloc (SZ * sizeof (int));
>    int *c = malloc (SZ * sizeof (int));
> 
>    for (int i = 0; i < SZ; i++)
>      c[i] = a[i] + b[i];
>    return c[SZ - 1];
> }
> 
> -------
> 
> And then compiling it without RVV support will run in 50 milis or so:
> 
> $ time ~/work/qemu/build/qemu-riscv64 -cpu rv64,debug=false,vext_spec=v1.0,v=true,vlen=128 
> ./foo-novect.out
> 
> real    0m0.043s
> user    0m0.025s
> sys    0m0.018s
> 
> Building the same program with RVV support slows it 4-5 times:
> 
> $ time ~/work/qemu/build/qemu-riscv64 -cpu 
> rv64,debug=false,vext_spec=v1.0,v=true,vlen=1024 ./foo.out
> 
> real    0m0.196s
> user    0m0.177s
> sys    0m0.018s
> 
> Using the lowest 'vlen' val allowed (128) will slow down things even further, taking it to
> ~0.260s.
> 
> 
> 'perf record' shows the following profile on the aforementioned binary:
> 
>    23.27%  qemu-riscv64  qemu-riscv64             [.] do_ld4_mmu
>    21.11%  qemu-riscv64  qemu-riscv64             [.] vext_ldst_us
>    14.05%  qemu-riscv64  qemu-riscv64             [.] cpu_ldl_le_data_ra
>    11.51%  qemu-riscv64  qemu-riscv64             [.] cpu_stl_le_data_ra
>     8.18%  qemu-riscv64  qemu-riscv64             [.] cpu_mmu_lookup
>     8.04%  qemu-riscv64  qemu-riscv64             [.] do_st4_mmu
>     2.04%  qemu-riscv64  qemu-riscv64             [.] ste_w
>     1.15%  qemu-riscv64  qemu-riscv64             [.] lde_w
>     1.02%  qemu-riscv64  [unknown]                [k] 0xffffffffb3001260
>     0.90%  qemu-riscv64  qemu-riscv64             [.] cpu_get_tb_cpu_state
>     0.64%  qemu-riscv64  qemu-riscv64             [.] tb_lookup
>     0.64%  qemu-riscv64  qemu-riscv64             [.] riscv_cpu_mmu_index
>     0.39%  qemu-riscv64  qemu-riscv64             [.] object_dynamic_cast_assert
> 
> 
> First thing that caught my attention is vext_ldst_us from target/riscv/vector_helper.c:
> 
>      /* load bytes from guest memory */
>      for (i = env->vstart; i < evl; i++, env->vstart++) {
>          k = 0;
>          while (k < nf) {
>              target_ulong addr = base + ((i * nf + k) << log2_esz);
>              ldst_elem(env, adjust_addr(env, addr), i + k * max_elems, vd, ra);
>              k++;
>          }
>      }
>      env->vstart = 0;
> 
> Given that this is a unit-stride load that access contiguous elements in memory it
> seems that this loop could be optimized/removed since it's loading/storing bytes
> one by one. I didn't find any TCG op to do that though. I assume that ARM SVE might
> have something of the sorts. Richard, care to comment?

Yes, SVE optimizes this case -- see

https://gitlab.com/qemu-project/qemu/-/blob/master/target/arm/tcg/sve_helper.c?ref_type=heads#L5651

It's not possible to do this generically, due to the predication. There's quite a lot of 
machinery that goes into expanding this such that each helper uses the correct host 
load/store insn in the fast case.


r~


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-07-25 18:54 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-24 13:40 [RFC] risc-v vector (RVV) emulation performance issues Daniel Henrique Barboza
2023-07-24 15:23 ` Philippe Mathieu-Daudé
2023-07-25 18:53 ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.