* Question about direct block chaining
@ 2022-04-18 14:54 Taylor Simpson
2022-04-18 15:37 ` Richard Henderson
0 siblings, 1 reply; 4+ messages in thread
From: Taylor Simpson @ 2022-04-18 14:54 UTC (permalink / raw)
To: qemu-devel; +Cc: Richard Henderson, Philippe Mathieu-Daudé
I've been working on speeding up the Hexagon target by using direct block chaining. Due to Hexagon's VLIW packet semantics (possibly multiple branches in a packet, not processing change-of-flow until packet commit), we have historically treated all change-of-flow as indirect.
I looked at the documentation here
https://qemu.readthedocs.io/en/latest/devel/tcg.html#direct-block-chaining
I implemented both approaches for inner loops and didn't see speedup in my benchmark. So, I have a couple of questions
1) What are the pros and cons of the two approaches (lookup_and_goto_ptr and goto_tb + exit_tb)?
2) How can I verify that direct block chaining is working properly?
With -d exec, I see lines like the following with goto_tb + exit_tb but NOT lookup_and_goto_ptr
Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40 [0050ac6c]
Thanks,
Taylor
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Question about direct block chaining
2022-04-18 14:54 Question about direct block chaining Taylor Simpson
@ 2022-04-18 15:37 ` Richard Henderson
2022-04-19 6:02 ` Taylor Simpson
0 siblings, 1 reply; 4+ messages in thread
From: Richard Henderson @ 2022-04-18 15:37 UTC (permalink / raw)
To: Taylor Simpson, qemu-devel; +Cc: Philippe Mathieu-Daudé
On 4/18/22 07:54, Taylor Simpson wrote:
> I implemented both approaches for inner loops and didn't see speedup in my benchmark. So, I have a couple of questions
> 1) What are the pros and cons of the two approaches (lookup_and_goto_ptr and goto_tb + exit_tb)?
goto_tb can only be used within a single page (plus other restrictions, see
translator_use_goto_tb). In addition, as documented, the change in cpu state must be
constant, beginning with a direct jump.
lookup_and_goto_ptr can handle any change in cpu state, including indirect jumps.
> 2) How can I verify that direct block chaining is working properly?
> With -d exec, I see lines like the following with goto_tb + exit_tb but NOT lookup_and_goto_ptr
> Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40 [0050ac6c]
Well, that's one way. I would have also suggested simply looking at -d op output, for the
various branchy cases you're considering, to see that all of the exits are as expected.
r~
^ permalink raw reply [flat|nested] 4+ messages in thread
* RE: Question about direct block chaining
2022-04-18 15:37 ` Richard Henderson
@ 2022-04-19 6:02 ` Taylor Simpson
2022-04-19 10:24 ` Alex Bennée
0 siblings, 1 reply; 4+ messages in thread
From: Taylor Simpson @ 2022-04-19 6:02 UTC (permalink / raw)
To: Richard Henderson, qemu-devel; +Cc: Philippe Mathieu-Daudé
> -----Original Message-----
> From: Richard Henderson <richard.henderson@linaro.org>
> Sent: Monday, April 18, 2022 10:38 AM
> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
> Subject: Re: Question about direct block chaining
>
> On 4/18/22 07:54, Taylor Simpson wrote:
> > I implemented both approaches for inner loops and didn't see speedup
> > in my benchmark. So, I have a couple of questions
> > 1) What are the pros and cons of the two approaches
> (lookup_and_goto_ptr and goto_tb + exit_tb)?
>
> goto_tb can only be used within a single page (plus other restrictions, see
> translator_use_goto_tb). In addition, as documented, the change in cpu
> state must be constant, beginning with a direct jump.
>
> lookup_and_goto_ptr can handle any change in cpu state, including indirect
> jumps.
>
>
> > 2) How can I verify that direct block chaining is working properly?
> > With -d exec, I see lines like the following with goto_tb + exit_tb but
> NOT lookup_and_goto_ptr
> > Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40
> > [0050ac6c]
>
> Well, that's one way. I would have also suggested simply looking at -d op
> output, for the various branchy cases you're considering, to see that all of the
> exits are as expected.
Thanks!!
I created a synthetic benchmark with a loop with a very small body and a very high number of iterations. I can see differences in execution time.
Here are my observations:
- goto_tb + exit_tb gives the fastest execution time because it will patch the native jump address
- lookup_and_goto_ptr is an improvement over tcg_gen_exit_tb(NULL, 0)
Taylor
^ permalink raw reply [flat|nested] 4+ messages in thread
* Re: Question about direct block chaining
2022-04-19 6:02 ` Taylor Simpson
@ 2022-04-19 10:24 ` Alex Bennée
0 siblings, 0 replies; 4+ messages in thread
From: Alex Bennée @ 2022-04-19 10:24 UTC (permalink / raw)
To: Taylor Simpson; +Cc: Richard Henderson, qemu-devel, Philippe Mathieu-Daudé
Taylor Simpson <tsimpson@quicinc.com> writes:
>> -----Original Message-----
>> From: Richard Henderson <richard.henderson@linaro.org>
>> Sent: Monday, April 18, 2022 10:38 AM
>> To: Taylor Simpson <tsimpson@quicinc.com>; qemu-devel@nongnu.org
>> Cc: Philippe Mathieu-Daudé <f4bug@amsat.org>
>> Subject: Re: Question about direct block chaining
>>
>> On 4/18/22 07:54, Taylor Simpson wrote:
>> > I implemented both approaches for inner loops and didn't see speedup
>> > in my benchmark. So, I have a couple of questions
>> > 1) What are the pros and cons of the two approaches
>> (lookup_and_goto_ptr and goto_tb + exit_tb)?
>>
>> goto_tb can only be used within a single page (plus other restrictions, see
>> translator_use_goto_tb). In addition, as documented, the change in cpu
>> state must be constant, beginning with a direct jump.
>>
>> lookup_and_goto_ptr can handle any change in cpu state, including indirect
>> jumps.
>>
>>
>> > 2) How can I verify that direct block chaining is working properly?
>> > With -d exec, I see lines like the following with goto_tb + exit_tb but
>> NOT lookup_and_goto_ptr
>> > Linking TBs 0x7fda44172e00 [0050ac38] index 1 -> 0x7fda44173b40
>> > [0050ac6c]
>>
>> Well, that's one way. I would have also suggested simply looking at -d op
>> output, for the various branchy cases you're considering, to see that all of the
>> exits are as expected.
>
> Thanks!!
>
> I created a synthetic benchmark with a loop with a very small body and a very high number of iterations. I can see differences in execution time.
>
> Here are my observations:
> - goto_tb + exit_tb gives the fastest execution time because it will
> patch the native jump address
As we would expect.
> - lookup_and_goto_ptr is an improvement over tcg_gen_exit_tb(NULL, 0)
Yes - mainly saving the cost of prologue and coming out of generated
code to the main loop. However once we get to tb_lookup and fail the
tb_jump_cache its going to take some time to get a block via QHT.
The tb_jump_cache is pretty simple in its implementation but I don't
know if we've ever decently characterised the hit rate and if it could
be improved. I think we already have slightly different hashing
functions for user-mode vs softmmu.
(aside I suspect the trace_vcpu_dstate check can now be removed which
should save a bit of time on the hash function).
--
Alex Bennée
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2022-04-19 10:36 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-18 14:54 Question about direct block chaining Taylor Simpson
2022-04-18 15:37 ` Richard Henderson
2022-04-19 6:02 ` Taylor Simpson
2022-04-19 10:24 ` Alex Bennée
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.