All of lore.kernel.org
 help / color / mirror / Atom feed
* Question about direct block chaining
@ 2022-04-18 14:54 Taylor Simpson
  2022-04-18 15:37 ` Richard Henderson
  0 siblings, 1 reply; 4+ messages in thread
From: Taylor Simpson @ 2022-04-18 14:54 UTC (permalink / raw)
  To: qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé

I've been working on speeding up the Hexagon target by using direct block chaining.  Due to Hexagon's VLIW packet semantics (possibly multiple branches in a packet, not processing change-of-flow until packet commit), we have historically treated all change-of-flow as indirect.

I looked at the documentation here
https://qemu.readthedocs.io/en/latest/devel/tcg.html#direct-block-chaining

I implemented both approaches for inner loops and didn't see speedup in my benchmark.  So, I have a couple of questions
1) What are the pros and cons of the two approaches (lookup_and_goto_ptr and goto_tb + exit_tb)?
2) How can I verify that direct block chaining is working properly?
      With -d exec, I see lines like the following with goto_tb + exit_tb but NOT lookup_and_goto_ptr
      Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40 [0050ac6c]

Thanks,
Taylor



^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question about direct block chaining
  2022-04-18 14:54 Question about direct block chaining Taylor Simpson
@ 2022-04-18 15:37 ` Richard Henderson
  2022-04-19  6:02   ` Taylor Simpson
  0 siblings, 1 reply; 4+ messages in thread
From: Richard Henderson @ 2022-04-18 15:37 UTC (permalink / raw)
  To: Taylor Simpson, qemu-devel; +Cc: Philippe Mathieu-Daudé

On 4/18/22 07:54, Taylor Simpson wrote:
> I implemented both approaches for inner loops and didn't see speedup in my benchmark.  So, I have a couple of questions
> 1) What are the pros and cons of the two approaches (lookup_and_goto_ptr and goto_tb + exit_tb)?

goto_tb can only be used within a single page (plus other restrictions, see 
translator_use_goto_tb).  In addition, as documented, the change in cpu state must be 
constant, beginning with a direct jump.

lookup_and_goto_ptr can handle any change in cpu state, including indirect jumps.


> 2) How can I verify that direct block chaining is working properly?
>        With -d exec, I see lines like the following with goto_tb + exit_tb but NOT lookup_and_goto_ptr
>        Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40 [0050ac6c]

Well, that's one way.  I would have also suggested simply looking at -d op output, for the 
various branchy cases you're considering, to see that all of the exits are as expected.


r~


^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: Question about direct block chaining
  2022-04-18 15:37 ` Richard Henderson
@ 2022-04-19  6:02   ` Taylor Simpson
  2022-04-19 10:24     ` Alex Bennée
  0 siblings, 1 reply; 4+ messages in thread
From: Taylor Simpson @ 2022-04-19  6:02 UTC (permalink / raw)
  To: Richard Henderson, qemu-devel; +Cc: Philippe Mathieu-Daudé



> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Monday, April 18, 2022 10:38 AM
> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
> Subject: Re: Question about direct block chaining
> 
> On 4/18/22 07:54, Taylor Simpson wrote:
> > I implemented both approaches for inner loops and didn't see speedup
> > in my benchmark.  So, I have a couple of questions
> > 1) What are the pros and cons of the two approaches
> (lookup_and_goto_ptr and goto_tb + exit_tb)?
> 
> goto_tb can only be used within a single page (plus other restrictions, see
> translator_use_goto_tb).  In addition, as documented, the change in cpu
> state must be constant, beginning with a direct jump.
> 
> lookup_and_goto_ptr can handle any change in cpu state, including indirect
> jumps.
> 
> 
> > 2) How can I verify that direct block chaining is working properly?
> >        With -d exec, I see lines like the following with goto_tb + exit_tb but
> NOT lookup_and_goto_ptr
> >        Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40
> > [0050ac6c]
> 
> Well, that's one way.  I would have also suggested simply looking at -d op
> output, for the various branchy cases you're considering, to see that all of the
> exits are as expected.

Thanks!!

I created a synthetic benchmark with a loop with a very small body and a very high number of iterations.  I can see differences in execution time.

Here are my observations:
- goto_tb + exit_tb gives the fastest execution time because it will patch the native jump address
- lookup_and_goto_ptr is an improvement over tcg_gen_exit_tb(NULL, 0)

Taylor


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Question about direct block chaining
  2022-04-19  6:02   ` Taylor Simpson
@ 2022-04-19 10:24     ` Alex Bennée
  0 siblings, 0 replies; 4+ messages in thread
From: Alex Bennée @ 2022-04-19 10:24 UTC (permalink / raw)
  To: Taylor Simpson; +Cc: Richard Henderson, qemu-devel, Philippe Mathieu-Daudé


Taylor Simpson <tsimpson@quicinc.com> writes:

>> -----Original Message-----
>> From: Richard Henderson <richard.henderson@linaro.org>
>> Sent: Monday, April 18, 2022 10:38 AM
>> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
>> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
>> Subject: Re: Question about direct block chaining
>> 
>> On 4/18/22 07:54, Taylor Simpson wrote:
>> > I implemented both approaches for inner loops and didn't see speedup
>> > in my benchmark.  So, I have a couple of questions
>> > 1) What are the pros and cons of the two approaches
>> (lookup_and_goto_ptr and goto_tb + exit_tb)?
>> 
>> goto_tb can only be used within a single page (plus other restrictions, see
>> translator_use_goto_tb).  In addition, as documented, the change in cpu
>> state must be constant, beginning with a direct jump.
>> 
>> lookup_and_goto_ptr can handle any change in cpu state, including indirect
>> jumps.
>> 
>> 
>> > 2) How can I verify that direct block chaining is working properly?
>> >        With -d exec, I see lines like the following with goto_tb + exit_tb but
>> NOT lookup_and_goto_ptr
>> >        Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40
>> > [0050ac6c]
>> 
>> Well, that's one way.  I would have also suggested simply looking at -d op
>> output, for the various branchy cases you're considering, to see that all of the
>> exits are as expected.
>
> Thanks!!
>
> I created a synthetic benchmark with a loop with a very small body and a very high number of iterations.  I can see differences in execution time.
>
> Here are my observations:
> - goto_tb + exit_tb gives the fastest execution time because it will
> patch the native jump address

As we would expect.

> - lookup_and_goto_ptr is an improvement over tcg_gen_exit_tb(NULL, 0)

Yes - mainly saving the cost of prologue and coming out of generated
code to the main loop. However once we get to tb_lookup and fail the
tb_jump_cache its going to take some time to get a block via QHT.

The tb_jump_cache is pretty simple in its implementation but I don't
know if we've ever decently characterised the hit rate and if it could
be improved. I think we already have slightly different hashing
functions for user-mode vs softmmu.

(aside I suspect the trace_vcpu_dstate check can now be removed which
should save a bit of time on the hash function).

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2022-04-19 10:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-18 14:54 Question about direct block chaining Taylor Simpson
2022-04-18 15:37 ` Richard Henderson
2022-04-19  6:02   ` Taylor Simpson
2022-04-19 10:24     ` Alex Bennée

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.