All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] Possible ppc comparision optimisation
@ 2013-05-07 22:56 Torbjorn Granlund
  2013-05-08  8:05 ` Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Torbjorn Granlund @ 2013-05-07 22:56 UTC (permalink / raw)
  To: qemu-devel

The current ppc gen_op_cmp generates a long sequence of instructions,
using a plain series of three disjoint compares.

It is possible to compute the 3 result bits more cleverly.  Below is a
possible replacement gen_op_cmp.  (It is tested by booting GNU/Linux
ppx64, but not much more than that.)

Surely this should be faster than the old code?  OK, it is less
readable, but cmp is pretty critical and should be made fast.

Should one truncate things using tcg_gen_trunc_tl_i32 and do the add,
xori, addi as i32 variants?  (Why?)

There could be a disadvantage of this compared to the old code, since
this has a chained algebraic dependency, while the old code's many
instructions might have been more independent.

static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
{
    TCGv t0 = tcg_temp_new();
    TCGv t1 = tcg_temp_new();
    TCGv_i32 s0 = tcg_temp_new_i32();

    tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);

    tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1);
    tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1);
    tcg_gen_add_tl(t0, t0, t1);
    tcg_gen_xori_tl(t0, t0, 1);
    tcg_gen_addi_tl(t0, t0, 1);
    tcg_gen_trunc_tl_i32(s0, t0);
    tcg_gen_shli_i32(s0, s0, 1);
    tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0);

    tcg_temp_free(t0);
    tcg_temp_free(t1);
    tcg_temp_free_i32(s0);
}

-- 
Torbjörn

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] Possible ppc comparision optimisation
  2013-05-07 22:56 [Qemu-devel] Possible ppc comparision optimisation Torbjorn Granlund
@ 2013-05-08  8:05 ` Paolo Bonzini
  2013-05-08 15:44   ` Torbjorn Granlund
  0 siblings, 1 reply; 4+ messages in thread
From: Paolo Bonzini @ 2013-05-08  8:05 UTC (permalink / raw)
  To: Torbjorn Granlund; +Cc: qemu-devel

Il 08/05/2013 00:56, Torbjorn Granlund ha scritto:
> The current ppc gen_op_cmp generates a long sequence of instructions,
> using a plain series of three disjoint compares.
> 
> It is possible to compute the 3 result bits more cleverly.  Below is a
> possible replacement gen_op_cmp.  (It is tested by booting GNU/Linux
> ppx64, but not much more than that.)
> 
> Surely this should be faster than the old code?  OK, it is less
> readable, but cmp is pretty critical and should be made fast.
> 
> Should one truncate things using tcg_gen_trunc_tl_i32 and do the add,
> xori, addi as i32 variants?  (Why?)

I think that would be faster on 32-bit hosts, truncs are cheap.

> There could be a disadvantage of this compared to the old code, since
> this has a chained algebraic dependency, while the old code's many
> instructions might have been more independent.

What about these alternatives:

setcond LT, t0, arg0, arg1
setcond EQ, t1, arg0, arg1
trunc  s0, t0
trunc  s1, t1
shli   s0, s0, 1                ; s0 = (arg0 < arg1) ? 2 : 0
subi   s1, s1, 2                ; s1 = (arg0 != arg1) ? -2 : -1
sub    s0, s0, s1               ; < 4       == 1      > 2
shli   s0, s0, 1                ; < 8       == 2      > 4

=======

setcond LT, t0, arg0, arg1
setcond NE, t1, arg0, arg1
trunc   s0, t0
trunc   s1, t1
add     s0, s0, s1              ; < 2       == 0      > 1
movi    s1, 1
add     s0, s0, s1              ; < 3       == 1      > 2
shl     s1, s1, s0              ; < 8       == 2      > 4

Paolo

> static inline void gen_op_cmp(TCGv arg0, TCGv arg1, int s, int crf)
> {
>     TCGv t0 = tcg_temp_new();
>     TCGv t1 = tcg_temp_new();
>     TCGv_i32 s0 = tcg_temp_new_i32();
> 
>     tcg_gen_trunc_tl_i32(cpu_crf[crf], cpu_so);
> 
>     tcg_gen_setcond_tl((s ? TCG_COND_LE: TCG_COND_LEU), t0, arg0, arg1);
>     tcg_gen_setcond_tl((s ? TCG_COND_LT: TCG_COND_LTU), t1, arg0, arg1);
>     tcg_gen_add_tl(t0, t0, t1);
>     tcg_gen_xori_tl(t0, t0, 1);
>     tcg_gen_addi_tl(t0, t0, 1);
>     tcg_gen_trunc_tl_i32(s0, t0);
>     tcg_gen_shli_i32(s0, s0, 1);
>     tcg_gen_or_i32(cpu_crf[crf], cpu_crf[crf], s0);
> 
>     tcg_temp_free(t0);
>     tcg_temp_free(t1);
>     tcg_temp_free_i32(s0);
> }
> 

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] Possible ppc comparision optimisation
  2013-05-08  8:05 ` Paolo Bonzini
@ 2013-05-08 15:44   ` Torbjorn Granlund
  2013-05-08 16:16     ` Paolo Bonzini
  0 siblings, 1 reply; 4+ messages in thread
From: Torbjorn Granlund @ 2013-05-08 15:44 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel

Paolo Bonzini <pbonzini@redhat.com> writes:

  I think that would be faster on 32-bit hosts, truncs are cheap.
  
And slower perhaps on 64-bit hosts, at least for operations where
additional explicit trunctation will be needed (such as before
comparisions and after right shifts).

  > There could be a disadvantage of this compared to the old code, since
  > this has a chained algebraic dependency, while the old code's many
  > instructions might have been more independent.
  
  What about these alternatives:
  
  setcond LT, t0, arg0, arg1
  setcond EQ, t1, arg0, arg1
  trunc  s0, t0
  trunc  s1, t1
  shli   s0, s0, 1                ; s0 = (arg0 < arg1) ? 2 : 0
  subi   s1, s1, 2                ; s1 = (arg0 != arg1) ? -2 : -1
  sub    s0, s0, s1               ; < 4       == 1      > 2
  shli   s0, s0, 1                ; < 8       == 2      > 4
  
  =======
  
  setcond LT, t0, arg0, arg1
  setcond NE, t1, arg0, arg1
  trunc   s0, t0
  trunc   s1, t1
  add     s0, s0, s1              ; < 2       == 0      > 1
  movi    s1, 1
  add     s0, s0, s1              ; < 3       == 1      > 2
  shl     s1, s1, s0              ; < 8       == 2      > 4
  
Surely there are many alternative forms.
Is your aim to add micro-parallelism?

(Your sequences look a bit curious.  Did you use a super-optimiser?)

-- 
Torbjörn

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Qemu-devel] Possible ppc comparision optimisation
  2013-05-08 15:44   ` Torbjorn Granlund
@ 2013-05-08 16:16     ` Paolo Bonzini
  0 siblings, 0 replies; 4+ messages in thread
From: Paolo Bonzini @ 2013-05-08 16:16 UTC (permalink / raw)
  To: Torbjorn Granlund; +Cc: qemu-devel

Il 08/05/2013 17:44, Torbjorn Granlund ha scritto:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>   I think that would be faster on 32-bit hosts, truncs are cheap.
>   
> And slower perhaps on 64-bit hosts, at least for operations where
> additional explicit trunctation will be needed (such as before
> comparisions and after right shifts).
> 
>   > There could be a disadvantage of this compared to the old code, since
>   > this has a chained algebraic dependency, while the old code's many
>   > instructions might have been more independent.
>   
>   What about these alternatives:
>   
>   setcond LT, t0, arg0, arg1
>   setcond EQ, t1, arg0, arg1
>   trunc  s0, t0
>   trunc  s1, t1
>   shli   s0, s0, 1                ; s0 = (arg0 < arg1) ? 2 : 0
>   subi   s1, s1, 2                ; s1 = (arg0 != arg1) ? -2 : -1
>   sub    s0, s0, s1               ; < 4       == 1      > 2
>   shli   s0, s0, 1                ; < 8       == 2      > 4
>   
>   =======
>   
>   setcond LT, t0, arg0, arg1
>   setcond NE, t1, arg0, arg1
>   trunc   s0, t0
>   trunc   s1, t1
>   add     s0, s0, s1              ; < 2       == 0      > 1
>   movi    s1, 1
>   add     s0, s0, s1              ; < 3       == 1      > 2
>   shl     s1, s1, s0              ; < 8       == 2      > 4
>   
> Surely there are many alternative forms.
> Is your aim to add micro-parallelism?

Yes, I think in this respect I think the first one is better.  The
second could be three instructions on machines that have a set-nth-bit
instruction _and_ a zero register, but I'm not sure they exist...

> (Your sequences look a bit curious.  Did you use a super-optimiser?)

No, but I am attracted to these curious sequences from my previous life
working on compilers. :)  I know your superoptimizer and, in fact, we
both worked on some parts of GCC (optimization of conditional
branches/stores), just 20 years apart.

The second is actually not too curious after you look at it for a while,
it is a variant of the usual (x > y) + (x >= y) trick used to generate a
0/1/2 result.  The first I found by trial and error based on yours; it
is basically (x < y) * 2 - (x == y) + 2, with some reordering to get
parallelism and avoid the need for subfi-like instructions.

Paolo

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2013-05-08 16:17 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2013-05-07 22:56 [Qemu-devel] Possible ppc comparision optimisation Torbjorn Granlund
2013-05-08  8:05 ` Paolo Bonzini
2013-05-08 15:44   ` Torbjorn Granlund
2013-05-08 16:16     ` Paolo Bonzini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.