[Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
@ 2017-03-25 16:52 Pranith Kumar
  2017-03-27 10:57 ` Richard Henderson
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Pranith Kumar @ 2017-03-25 16:52 UTC (permalink / raw)
  To: Richard Henderson, Peter Maydell, Paolo Bonzini, Emilio G. Cota
  Cc: Alex Bennée, qemu-devel

Hello,

With MTTCG code now merged in mainline, I tried to see if we are able to run
x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on
a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model
whereas ARM64 is a weak memory model, I added a patch to generate fence
instructions for every guest memory access. After some minor fixes, I was
successfully able to boot a 4 core guest all the way to the desktop (albeit
with a 1GB backing swap). However the performance is severely
limited and the guest is barely usable. Based on my observations, I think
there are some easily implementable additions we can make to improve the
performance of TCG in general and on ARM64 in particular. I propose to do the
following as part of Google Summer of Code 2017.

* Implement jump-to-register instruction on ARM64 to overcome the 128MB
  translation cache size limit.

  The translation cache size for an ARM64 host is currently limited to 128
  MB. This limitation is imposed by utilizing a branch instruction which
  encodes the jump offset and is limited by the number of bits it can use for
  the range of the offset. The performance impact by this limitation is severe
  and can be observed when you try to run large programs like a browser in the
  guest. The cache is flushed several times before the browser starts and the
  performance is not satisfactory. This limitation can be overcome by
  generating a branch-to-register instruction and utilizing that when the
  destination address is outside the range of what can be encoded in current
  branch instruction.

* Implement an LRU translation block code cache.

  In the current TCG design, when the translation cache fills up, we flush all
  the translated blocks (TBs) to free up space. We can improve this situation
  by not flushing the TBs that were recently used i.e., by implementing an LRU
  policy for freeing the blocks. This should avoid the re-translation overhead
  for frequently used blocks and improve performance.

* Avoid consistency overhead for strong memory model guests by generating
  load-acquire and store-release instructions.

  To run a strongly ordered guest on a weakly ordered host using MTTCG, for
  example, x86 on ARM64, we have to generate fence instructions for all the
  guest memory accesses to ensure consistency. The overhead imposed by these
  fence instructions is significant (almost 3x when compared to a run without
  fence instructions). ARM64 provides load-acquire and store-release
  instructions which are sequentially consistent and can be used instead of
  generating fence instructions. I plan to add support to generate these
  instructions in the TCG run-time to reduce the consistency overhead in
  MTTCG.

Alex Bennée, who mentored me last year, has agreed to mentor me again this
time if the proposal is accepted.

Please let me know if you have any comments or suggestions. Also please let me
know if there are other enhancements that are easily implementable to increase
TCG performance as part of this project or otherwise.

Thanks,
-- 
Pranith

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-25 16:52 [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Pranith Kumar
@ 2017-03-27 10:57 ` Richard Henderson
  2017-03-27 13:22   ` Alex Bennée
                     ` (2 more replies)
  2017-03-27 11:32 ` [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Paolo Bonzini
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 23+ messages in thread
From: Richard Henderson @ 2017-03-27 10:57 UTC (permalink / raw)
  To: Pranith Kumar, Peter Maydell, Paolo Bonzini, Emilio G. Cota
  Cc: Alex Bennée, qemu-devel

On 03/26/2017 02:52 AM, Pranith Kumar wrote:
> Hello,
>
> With MTTCG code now merged in mainline, I tried to see if we are able to run
> x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on
> a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model
> whereas ARM64 is a weak memory model, I added a patch to generate fence
> instructions for every guest memory access. After some minor fixes, I was
> successfully able to boot a 4 core guest all the way to the desktop (albeit
> with a 1GB backing swap). However the performance is severely
> limited and the guest is barely usable. Based on my observations, I think
> there are some easily implementable additions we can make to improve the
> performance of TCG in general and on ARM64 in particular. I propose to do the
> following as part of Google Summer of Code 2017.
>
>
> * Implement jump-to-register instruction on ARM64 to overcome the 128MB
>   translation cache size limit.
>
>   The translation cache size for an ARM64 host is currently limited to 128
>   MB. This limitation is imposed by utilizing a branch instruction which
>   encodes the jump offset and is limited by the number of bits it can use for
>   the range of the offset. The performance impact by this limitation is severe
>   and can be observed when you try to run large programs like a browser in the
>   guest. The cache is flushed several times before the browser starts and the
>   performance is not satisfactory. This limitation can be overcome by
>   generating a branch-to-register instruction and utilizing that when the
>   destination address is outside the range of what can be encoded in current
>   branch instruction.

128MB is really quite large.  I doubt doubling the cache size will really help 
that much.  That said, it's really quite trivial to make this change, if you'd 
like to experiment.

FWIW, I rarely see TB flushes for alpha -- not one during an entire gcc 
bootstrap.  Now, this is usually with 4GB ram, which by default implies 512MB 
translation cache.  But it does mean that, given an ideal guest, TB flushes 
should not dominate anything at all.

If you're seeing multiple flushes during the startup of a browser, your guest 
must be flushing for other reasons than the code_gen_buffer being full.

> * Implement an LRU translation block code cache.
>
>   In the current TCG design, when the translation cache fills up, we flush all
>   the translated blocks (TBs) to free up space. We can improve this situation
>   by not flushing the TBs that were recently used i.e., by implementing an LRU
>   policy for freeing the blocks. This should avoid the re-translation overhead
>   for frequently used blocks and improve performance.

The major problem you'll encounter is how to manage allocation in this case.

The current mechanism means that it is trivial to not know how much code is 
going to be generated for a given set of TCG opcodes.  When we reach the 
high-water mark, we've run out of room.  We then flush everything and start 
over at the beginning of the buffer.

If you manage the cache with an allocator, you'll need to know in advance how 
much code is going to be generated.  This is going to require that you either 
(1) severely over-estimate the space required (qemu_ld generates lots more code 
than just add), (2) severely increase the time required, by generating code 
twice, or (3) somewhat increase the time required, by generating 
position-independent code into an external buffer and copying it into place 
after determining the size.

> * Avoid consistency overhead for strong memory model guests by generating
>   load-acquire and store-release instructions.

This is probably required for good performance of the user-only code path, but 
considering the number of other insns required for the system tlb lookup, I'm 
surprised that the memory barrier matters.

> Please let me know if you have any comments or suggestions. Also please let me
> know if there are other enhancements that are easily implementable to increase
> TCG performance as part of this project or otherwise.

I think it would be interesting to place TranslationBlock structures into the 
same memory block as code_gen_buffer, immediately before the code that 
implements the TB.

Consider what happens within every TB:

(1) We have one or more references to the TB address, via exit_tb.

For aarch64, this will normally require 2-4 insns.

# alpha-softmmu
0x7f75152114:  d0ffb320      adrp x0, #-0x99a000 (addr 0x7f747b8000)
0x7f75152118:  91004c00      add x0, x0, #0x13 (19)
0x7f7515211c:  17ffffc3      b #-0xf4 (addr 0x7f75152028)

# alpha-linux-user
0x00569500:  d2800260      mov x0, #0x13
0x00569504:  f2b59820      movk x0, #0xacc1, lsl #16
0x00569508:  f2c00fe0      movk x0, #0x7f, lsl #32
0x0056950c:  17ffffdf      b #-0x84 (addr 0x569488)

We would reduce this to one insn, always, if the TB were close by, since the 
ADR instruction has a range of 1MB.

(2) We have zero to two references to a linked TB, via goto_tb.

Your stated goal above for eliminating the code_gen_buffer maximum of 128MB can 
be done in two ways.

(2A) Raise the maximum to 2GB.  For this we would align an instruction pair, 
adrp+add, to compute the address; the following insn would branch.  The update 
code would write a new destination by modifing the adrp+add with a single 
64-bit store.

(2B) Eliminate the maximum altogether by referencing the destination directly 
in the TB.  This is the !USE_DIRECT_JUMP path.  It is normally not used on 
64-bit targets because computing the full 64-bit address of the TB is harder, 
or just as hard, as computing the full 64-bit address of the destination.

However, if the TB is nearby, aarch64 can load the address from 
TB.jmp_target_addr in one insn, with LDR (literal).  This pc-relative load also 
has a 1MB range.

This has the side benefit that it is much quicker to re-link TBs, both in the 
computation of the code for the destination as well as re-flushing the icache.

In addition, I strongly suspect the 1,342,177 entries (153MB) that we currently 
allocate for tcg_ctx.tb_ctx.tbs, given a 512MB code_gen_buffer, is excessive.

If we co-allocate the TB and the code, then we get exactly the right number of 
TBs allocated with no further effort.

There will be some additional memory wastage, since we'll want to keep the code 
and the data in different cache lines and that means padding, but I don't think 
that'll be significant.  Indeed, given the above over-allocation will probably 
still be a net savings.

r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-25 16:52 [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Pranith Kumar
  2017-03-27 10:57 ` Richard Henderson
@ 2017-03-27 11:32 ` Paolo Bonzini
  2017-03-28  3:07   ` Pranith Kumar
  2017-03-27 15:54 ` Stefan Hajnoczi
  2017-06-06 17:13 ` Emilio G. Cota
  3 siblings, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2017-03-27 11:32 UTC (permalink / raw)
  To: Pranith Kumar, Richard Henderson, Peter Maydell, Emilio G. Cota
  Cc: Alex Bennée, qemu-devel



On 25/03/2017 17:52, Pranith Kumar wrote:
> * Implement an LRU translation block code cache.
> 
>   In the current TCG design, when the translation cache fills up, we flush all
>   the translated blocks (TBs) to free up space. We can improve this situation
>   by not flushing the TBs that were recently used i.e., by implementing an LRU
>   policy for freeing the blocks. This should avoid the re-translation overhead
>   for frequently used blocks and improve performance.

IIRC, Emilio measured one flush every roughly 10 seconds with 128 MB
cache in system emulation mode---and "never" is a pretty accurate
estimate for user-mode emulation.  This means that a really hot block
would be retranslated very quickly.

Paolo

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-27 10:57 ` Richard Henderson
@ 2017-03-27 13:22   ` Alex Bennée
  2017-03-28  3:03   ` Pranith Kumar
  2017-06-02 23:39   ` [Qemu-devel] [PATCH] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota
  2 siblings, 0 replies; 23+ messages in thread
From: Alex Bennée @ 2017-03-27 13:22 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Pranith Kumar, Peter Maydell, Paolo Bonzini, Emilio G. Cota, qemu-devel


Richard Henderson <rth@twiddle.net> writes:

> On 03/26/2017 02:52 AM, Pranith Kumar wrote:
>> Hello,
>>
<snip>
>
>> Please let me know if you have any comments or suggestions. Also please let me
>> know if there are other enhancements that are easily implementable to increase
>> TCG performance as part of this project or otherwise.
>
> I think it would be interesting to place TranslationBlock structures
> into the same memory block as code_gen_buffer, immediately before the
> code that implements the TB.
>
> Consider what happens within every TB:
>
> (1) We have one or more references to the TB address, via exit_tb.
>
> For aarch64, this will normally require 2-4 insns.
>
> # alpha-softmmu
> 0x7f75152114:  d0ffb320      adrp x0, #-0x99a000 (addr 0x7f747b8000)
> 0x7f75152118:  91004c00      add x0, x0, #0x13 (19)
> 0x7f7515211c:  17ffffc3      b #-0xf4 (addr 0x7f75152028)
>
> # alpha-linux-user
> 0x00569500:  d2800260      mov x0, #0x13
> 0x00569504:  f2b59820      movk x0, #0xacc1, lsl #16
> 0x00569508:  f2c00fe0      movk x0, #0x7f, lsl #32
> 0x0056950c:  17ffffdf      b #-0x84 (addr 0x569488)
>
> We would reduce this to one insn, always, if the TB were close by,
> since the ADR instruction has a range of 1MB.

Having a TB address statically addressable from the generated code would
also be very handy for doing things like rough block execution counts
(or even precise if you want to go through the atomic penalty for it).

It would be nice for future work to be able to track where our hot-paths
are through generated code.

>
>
> (2) We have zero to two references to a linked TB, via goto_tb.
>
> Your stated goal above for eliminating the code_gen_buffer maximum of
> 128MB can be done in two ways.
>
> (2A) Raise the maximum to 2GB.  For this we would align an instruction
> pair, adrp+add, to compute the address; the following insn would
> branch.  The update code would write a new destination by modifing the
> adrp+add with a single 64-bit store.
>
> (2B) Eliminate the maximum altogether by referencing the destination
> directly in the TB.  This is the !USE_DIRECT_JUMP path.  It is
> normally not used on 64-bit targets because computing the full 64-bit
> address of the TB is harder, or just as hard, as computing the full
> 64-bit address of the destination.
>
> However, if the TB is nearby, aarch64 can load the address from
> TB.jmp_target_addr in one insn, with LDR (literal).  This pc-relative
> load also has a 1MB range.
>
> This has the side benefit that it is much quicker to re-link TBs, both
> in the computation of the code for the destination as well as
> re-flushing the icache.
>
>
> In addition, I strongly suspect the 1,342,177 entries (153MB) that we
> currently allocate for tcg_ctx.tb_ctx.tbs, given a 512MB
> code_gen_buffer, is excessive.
>
> If we co-allocate the TB and the code, then we get exactly the right
> number of TBs allocated with no further effort.
>
> There will be some additional memory wastage, since we'll want to keep
> the code and the data in different cache lines and that means padding,
> but I don't think that'll be significant.  Indeed, given the above
> over-allocation will probably still be a net savings.
>
>
> r~


--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-25 16:52 [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Pranith Kumar
  2017-03-27 10:57 ` Richard Henderson
  2017-03-27 11:32 ` [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Paolo Bonzini
@ 2017-03-27 15:54 ` Stefan Hajnoczi
  2017-03-27 17:13   ` Pranith Kumar
  2017-06-06 17:13 ` Emilio G. Cota
  3 siblings, 1 reply; 23+ messages in thread
From: Stefan Hajnoczi @ 2017-03-27 15:54 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: Richard Henderson, Peter Maydell, Paolo Bonzini, Emilio G. Cota,
	Alex Bennée, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

On Sat, Mar 25, 2017 at 12:52:35PM -0400, Pranith Kumar wrote:
> Alex Bennée, who mentored me last year, has agreed to mentor me again this
> time if the proposal is accepted.

Thanks, the project idea looks good for GSoC.  I've talked to Alex about
adding it to the wiki page.

The "How to propose a custom project idea" section on the wiki says:

  Note that other candidates can apply for newly added project ideas.
  This ensures that custom project ideas are fair and open.

This means that Alex has agreed to mentor the _project idea_.  Proposing
a custom project idea doesn't guarantee that you will be selected for
it.

I think you already knew that but I wanted to clarify in case someone
reading misinterprets what you wrote to think custom project ideas are a
loophole for getting into GSoC.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-27 15:54 ` Stefan Hajnoczi
@ 2017-03-27 17:13   ` Pranith Kumar
  0 siblings, 0 replies; 23+ messages in thread
From: Pranith Kumar @ 2017-03-27 17:13 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Richard Henderson, Peter Maydell, Paolo Bonzini, Emilio G. Cota,
	Alex Bennée, qemu-devel

Hi Stefan,

On Mon, Mar 27, 2017 at 11:54 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
> On Sat, Mar 25, 2017 at 12:52:35PM -0400, Pranith Kumar wrote:
>> Alex Bennée, who mentored me last year, has agreed to mentor me again this
>> time if the proposal is accepted.
>
> Thanks, the project idea looks good for GSoC.  I've talked to Alex about
> adding it to the wiki page.
>
> The "How to propose a custom project idea" section on the wiki says:
>
>   Note that other candidates can apply for newly added project ideas.
>   This ensures that custom project ideas are fair and open.
>
> This means that Alex has agreed to mentor the _project idea_.  Proposing
> a custom project idea doesn't guarantee that you will be selected for
> it.
>
> I think you already knew that but I wanted to clarify in case someone
> reading misinterprets what you wrote to think custom project ideas are a
> loophole for getting into GSoC.

Yes, I was waiting for the project idea to be finalized before mailing
you with the filled out template. But if you think it will be easier
if I add it first and then edit it, I will send you the template. I
will update the wiki as the discussion progresses.

-- 
Pranith

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-27 10:57 ` Richard Henderson
  2017-03-27 13:22   ` Alex Bennée
@ 2017-03-28  3:03   ` Pranith Kumar
  2017-03-28  3:09     ` Pranith Kumar
  2017-06-02 23:39   ` [Qemu-devel] [PATCH] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota
  2 siblings, 1 reply; 23+ messages in thread
From: Pranith Kumar @ 2017-03-28  3:03 UTC (permalink / raw)
  To: Richard Henderson
  Cc: Peter Maydell, Paolo Bonzini, Emilio G. Cota, Alex Bennée,
	qemu-devel

Hi Richard,

Thanks for the feedback. Please find some comments inline.

On Mon, Mar 27, 2017 at 6:57 AM, Richard Henderson <rth@twiddle.net> wrote:
>
> 128MB is really quite large.  I doubt doubling the cache size will really
> help that much.  That said, it's really quite trivial to make this change,
> if you'd like to experiment.
>
> FWIW, I rarely see TB flushes for alpha -- not one during an entire gcc
> bootstrap.  Now, this is usually with 4GB ram, which by default implies
> 512MB translation cache.  But it does mean that, given an ideal guest, TB
> flushes should not dominate anything at all.
>
> If you're seeing multiple flushes during the startup of a browser, your
> guest must be flushing for other reasons than the code_gen_buffer being
> full.
>

This is indeed the case. From commit a9353fe897ca onwards, we are
flushing the tb cache instead of invalidating a single TB from
breakpoint_invalidate(). Now that MTTCG added proper tb/mmap locking,
we can revert that commit. I will do so once the merge windows opens.

>
>> * Implement an LRU translation block code cache.
>
>
> The major problem you'll encounter is how to manage allocation in this case.
>
> The current mechanism means that it is trivial to not know how much code is
> going to be generated for a given set of TCG opcodes.  When we reach the
> high-water mark, we've run out of room.  We then flush everything and start
> over at the beginning of the buffer.
>
> If you manage the cache with an allocator, you'll need to know in advance
> how much code is going to be generated.  This is going to require that you
> either (1) severely over-estimate the space required (qemu_ld generates lots
> more code than just add), (2) severely increase the time required, by
> generating code twice, or (3) somewhat increase the time required, by
> generating position-independent code into an external buffer and copying it
> into place after determining the size.
>

3 seems to the only feasible options, but I am not sure how easy it is
to generate position-independent code. Do you think it can be done as
a GSoC project?

>
>> * Avoid consistency overhead for strong memory model guests by generating
>>   load-acquire and store-release instructions.
>
>
> This is probably required for good performance of the user-only code path,
> but considering the number of other insns required for the system tlb
> lookup, I'm surprised that the memory barrier matters.
>

I know that having some experimental data will help to accurately show
the benefit here, but my observation from generating store-release
instruction instead of store+fence is that it helps make the system
more usable. I will try to collect this data for a linux x86 guest.

>
> I think it would be interesting to place TranslationBlock structures into
> the same memory block as code_gen_buffer, immediately before the code that
> implements the TB.
>
> Consider what happens within every TB:
>
> (1) We have one or more references to the TB address, via exit_tb.
>
> For aarch64, this will normally require 2-4 insns.
>
> # alpha-softmmu
> 0x7f75152114:  d0ffb320      adrp x0, #-0x99a000 (addr 0x7f747b8000)
> 0x7f75152118:  91004c00      add x0, x0, #0x13 (19)
> 0x7f7515211c:  17ffffc3      b #-0xf4 (addr 0x7f75152028)
>
> # alpha-linux-user
> 0x00569500:  d2800260      mov x0, #0x13
> 0x00569504:  f2b59820      movk x0, #0xacc1, lsl #16
> 0x00569508:  f2c00fe0      movk x0, #0x7f, lsl #32
> 0x0056950c:  17ffffdf      b #-0x84 (addr 0x569488)
>
> We would reduce this to one insn, always, if the TB were close by, since the
> ADR instruction has a range of 1MB.
>
> (2) We have zero to two references to a linked TB, via goto_tb.
>
> Your stated goal above for eliminating the code_gen_buffer maximum of 128MB
> can be done in two ways.
>
> (2A) Raise the maximum to 2GB.  For this we would align an instruction pair,
> adrp+add, to compute the address; the following insn would branch.  The
> update code would write a new destination by modifing the adrp+add with a
> single 64-bit store.
>
> (2B) Eliminate the maximum altogether by referencing the destination
> directly in the TB.  This is the !USE_DIRECT_JUMP path.  It is normally not
> used on 64-bit targets because computing the full 64-bit address of the TB
> is harder, or just as hard, as computing the full 64-bit address of the
> destination.
>
> However, if the TB is nearby, aarch64 can load the address from
> TB.jmp_target_addr in one insn, with LDR (literal).  This pc-relative load
> also has a 1MB range.
>
> This has the side benefit that it is much quicker to re-link TBs, both in
> the computation of the code for the destination as well as re-flushing the
> icache.

This(2B) is the idea I had in mind. If we could have a combination of
both the above. If address range falls outside the 1MB range, we take
the penalty and generate the full 64-bit address.

>
>
> In addition, I strongly suspect the 1,342,177 entries (153MB) that we
> currently allocate for tcg_ctx.tb_ctx.tbs, given a 512MB code_gen_buffer, is
> excessive.
>
> If we co-allocate the TB and the code, then we get exactly the right number
> of TBs allocated with no further effort.
>
> There will be some additional memory wastage, since we'll want to keep the
> code and the data in different cache lines and that means padding, but I
> don't think that'll be significant.  Indeed, given the above over-allocation
> will probably still be a net savings.
>

If you think the project makes sense, I will add it to the GSoC wiki
so that others can also apply for it. Please let me know if you are
interested in mentoring it along with Alex.

Thanks,
-- 
Pranith

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-27 11:32 ` [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Paolo Bonzini
@ 2017-03-28  3:07   ` Pranith Kumar
  0 siblings, 0 replies; 23+ messages in thread
From: Pranith Kumar @ 2017-03-28  3:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Richard Henderson, Peter Maydell, Emilio G. Cota,
	Alex Bennée, qemu-devel

Hi Paolo,

On Mon, Mar 27, 2017 at 7:32 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 25/03/2017 17:52, Pranith Kumar wrote:
>> * Implement an LRU translation block code cache.
>>
>>   In the current TCG design, when the translation cache fills up, we flush all
>>   the translated blocks (TBs) to free up space. We can improve this situation
>>   by not flushing the TBs that were recently used i.e., by implementing an LRU
>>   policy for freeing the blocks. This should avoid the re-translation overhead
>>   for frequently used blocks and improve performance.
>
> IIRC, Emilio measured one flush every roughly 10 seconds with 128 MB
> cache in system emulation mode---and "never" is a pretty accurate
> estimate for user-mode emulation.  This means that a really hot block
> would be retranslated very quickly.
>

OK. The problem with re-translation is that it is a serializing step
in the current design. All the cores have to wait for the translation
to complete. I think it will be a win if we could avoid it, although,
I should admit that I am not sure how much that benefit would be.

-- 
Pranith

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-28  3:03   ` Pranith Kumar
@ 2017-03-28  3:09     ` Pranith Kumar
  2017-03-28 10:03       ` Stefan Hajnoczi
  0 siblings, 1 reply; 23+ messages in thread
From: Pranith Kumar @ 2017-03-28  3:09 UTC (permalink / raw)
  To: Richard Henderson, Stefan Hajnoczi, Alex Bennée
  Cc: Peter Maydell, Paolo Bonzini, Emilio G. Cota, qemu-devel

On Mon, Mar 27, 2017 at 11:03 PM, Pranith Kumar <bobby.prani@gmail.com> wrote:

>
> If you think the project makes sense, I will add it to the GSoC wiki
> so that others can also apply for it. Please let me know if you are
> interested in mentoring it along with Alex.
>

One other thing is if you think the scope is too vast, can we split
this and have multiple GSoC projects? In that case, having more
mentors should help.

-- 
Pranith

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-28  3:09     ` Pranith Kumar
@ 2017-03-28 10:03       ` Stefan Hajnoczi
  0 siblings, 0 replies; 23+ messages in thread
From: Stefan Hajnoczi @ 2017-03-28 10:03 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: Richard Henderson, Alex Bennée, Peter Maydell,
	Paolo Bonzini, Emilio G. Cota, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 895 bytes --]

On Mon, Mar 27, 2017 at 11:09:23PM -0400, Pranith Kumar wrote:
> On Mon, Mar 27, 2017 at 11:03 PM, Pranith Kumar <bobby.prani@gmail.com> wrote:
> 
> >
> > If you think the project makes sense, I will add it to the GSoC wiki
> > so that others can also apply for it. Please let me know if you are
> > interested in mentoring it along with Alex.
> >
> 
> One other thing is if you think the scope is too vast, can we split
> this and have multiple GSoC projects? In that case, having more
> mentors should help.

It's up to the mentor(s) if they want to take on more students in this
area.  Regarding your own project plan:

It's fine to have stretch goals that will be completed if time permits.
The project plan can be adjusted so don't worry about being ambitious -
it won't be held against you if you've agreed with your mentor on
certain goals that may not fit.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply	[flat|nested] 23+ messages in thread

* [Qemu-devel] [PATCH] tcg: allocate TB structs before the corresponding translated code
  2017-03-27 10:57 ` Richard Henderson
  2017-03-27 13:22   ` Alex Bennée
  2017-03-28  3:03   ` Pranith Kumar
@ 2017-06-02 23:39   ` Emilio G. Cota
  2017-06-04 17:47     ` Richard Henderson
  2 siblings, 1 reply; 23+ messages in thread
From: Emilio G. Cota @ 2017-06-02 23:39 UTC (permalink / raw)
  To: qemu-devel
  Cc: Richard Henderson, alex.bennee, Peter Maydell, Paolo Bonzini,
	Pranith Kumar

Allocating an arbitrarily-sized array of tbs results in either
(a) a lot of memory wasted or (b) unnecessary flushes of the code
cache when we run out of TB structs in the array.

An obvious solution would be to just malloc a TB struct when needed,
and keep the TB array as an array of pointers (recall that tb_find_pc()
needs the TB array to run in O(log n)).

Perhaps a better solution, which is implemented in this patch, is to
allocate TB's right before the translated code they describe. This
results in some memory waste due to padding to have code and TBs in
separate cache lines--for instance, I measured 4.7% of padding in the
used portion of code_gen_buffer when booting Linux on aarch64.
However, it can allow for optimizations in some host architectures,
since TCG backends could safely assume that the TB and the corresponding
translated code are very close to each other in memory. See this message
by rth for a detailed explanation:

  https://lists.gnu.org/archive/html/qemu-devel/2017-03/msg05172.html
  Subject: Re: GSoC 2017 Proposal: TCG performance enhancements
  Message-ID: <1e67644b-4b30-887e-d329-1848e94c9484@twiddle.net>

Note: this patch applies on top of rth's tcg-next branch (a34a15462).

Suggested-by: Richard Henderson <rth@twiddle.net>
Signed-off-by: Emilio G. Cota <cota@braap.org>
---
 include/exec/exec-all.h   |  2 +-
 include/exec/tb-context.h |  3 ++-
 tcg/tcg.c                 | 16 ++++++++++++++++
 tcg/tcg.h                 |  2 +-
 translate-all.c           | 37 ++++++++++++++++++++++---------------
 5 files changed, 42 insertions(+), 18 deletions(-)

diff --git a/include/exec/exec-all.h b/include/exec/exec-all.h
index 87ae10b..e431548 100644
--- a/include/exec/exec-all.h
+++ b/include/exec/exec-all.h
@@ -363,7 +363,7 @@ struct TranslationBlock {
      */
     uintptr_t jmp_list_next[2];
     uintptr_t jmp_list_first;
-};
+} QEMU_ALIGNED(64);
 
 void tb_free(TranslationBlock *tb);
 void tb_flush(CPUState *cpu);
diff --git a/include/exec/tb-context.h b/include/exec/tb-context.h
index c7f17f2..25c2afe 100644
--- a/include/exec/tb-context.h
+++ b/include/exec/tb-context.h
@@ -31,8 +31,9 @@ typedef struct TBContext TBContext;
 
 struct TBContext {
 
-    TranslationBlock *tbs;
+    TranslationBlock **tbs;
     struct qht htable;
+    size_t tbs_size;
     int nb_tbs;
     /* any access to the tbs or the page table must use this lock */
     QemuMutex tb_lock;
diff --git a/tcg/tcg.c b/tcg/tcg.c
index cb898f1..1e9da5b 100644
--- a/tcg/tcg.c
+++ b/tcg/tcg.c
@@ -383,6 +383,22 @@ void tcg_context_init(TCGContext *s)
     }
 }
 
+/*
+ * Allocate TBs right before their corresponding translated code, making
+ * sure that TBs and code are on different cache lines.
+ */
+TranslationBlock *tcg_tb_alloc(TCGContext *s)
+{
+    void *aligned;
+
+    aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, 64);
+    if (unlikely(aligned + sizeof(TranslationBlock) > s->code_gen_highwater)) {
+        return NULL;
+    }
+    s->code_gen_ptr += aligned - s->code_gen_ptr + sizeof(TranslationBlock);
+    return aligned;
+}
+
 void tcg_prologue_init(TCGContext *s)
 {
     size_t prologue_size, total_size;
diff --git a/tcg/tcg.h b/tcg/tcg.h
index 5ec48d1..9e37722 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -697,7 +697,6 @@ struct TCGContext {
        here, because there's too much arithmetic throughout that relies
        on addition and subtraction working on bytes.  Rely on the GCC
        extension that allows arithmetic on void*.  */
-    int code_gen_max_blocks;
     void *code_gen_prologue;
     void *code_gen_epilogue;
     void *code_gen_buffer;
@@ -756,6 +755,7 @@ static inline bool tcg_op_buf_full(void)
 /* tb_lock must be held for tcg_malloc_internal. */
 void *tcg_malloc_internal(TCGContext *s, int size);
 void tcg_pool_reset(TCGContext *s);
+TranslationBlock *tcg_tb_alloc(TCGContext *s);
 
 void tb_lock(void);
 void tb_unlock(void);
diff --git a/translate-all.c b/translate-all.c
index b3ee876..e36c4d0 100644
--- a/translate-all.c
+++ b/translate-all.c
@@ -781,12 +781,13 @@ static inline void code_gen_alloc(size_t tb_size)
         exit(1);
     }
 
-    /* Estimate a good size for the number of TBs we can support.  We
-       still haven't deducted the prologue from the buffer size here,
-       but that's minimal and won't affect the estimate much.  */
-    tcg_ctx.code_gen_max_blocks
-        = tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE;
-    tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock, tcg_ctx.code_gen_max_blocks);
+    /* size this conservatively -- realloc later if needed */
+    tcg_ctx.tb_ctx.tbs_size =
+        tcg_ctx.code_gen_buffer_size / CODE_GEN_AVG_BLOCK_SIZE / 8;
+    if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) {
+        tcg_ctx.tb_ctx.tbs_size = 1024;
+    }
+    tcg_ctx.tb_ctx.tbs = g_new(TranslationBlock *, tcg_ctx.tb_ctx.tbs_size);
 
     qemu_mutex_init(&tcg_ctx.tb_ctx.tb_lock);
 }
@@ -828,13 +829,20 @@ bool tcg_enabled(void)
 static TranslationBlock *tb_alloc(target_ulong pc)
 {
     TranslationBlock *tb;
+    TBContext *ctx;
 
     assert_tb_locked();
 
-    if (tcg_ctx.tb_ctx.nb_tbs >= tcg_ctx.code_gen_max_blocks) {
+    tb = tcg_tb_alloc(&tcg_ctx);
+    if (unlikely(tb == NULL)) {
         return NULL;
     }
-    tb = &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs++];
+    ctx = &tcg_ctx.tb_ctx;
+    if (unlikely(ctx->nb_tbs == ctx->tbs_size)) {
+        ctx->tbs_size *= 2;
+        ctx->tbs = g_renew(TranslationBlock *, ctx->tbs, ctx->tbs_size);
+    }
+    ctx->tbs[ctx->nb_tbs++] = tb;
     tb->pc = pc;
     tb->cflags = 0;
     tb->invalid = false;
@@ -850,8 +858,8 @@ void tb_free(TranslationBlock *tb)
        Ignore the hard cases and just back up if this TB happens to
        be the last one generated.  */
     if (tcg_ctx.tb_ctx.nb_tbs > 0 &&
-            tb == &tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
-        tcg_ctx.code_gen_ptr = tb->tc_ptr;
+            tb == tcg_ctx.tb_ctx.tbs[tcg_ctx.tb_ctx.nb_tbs - 1]) {
+        tcg_ctx.code_gen_ptr = tb->tc_ptr - sizeof(TranslationBlock);
         tcg_ctx.tb_ctx.nb_tbs--;
     }
 }
@@ -1666,7 +1674,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
     m_max = tcg_ctx.tb_ctx.nb_tbs - 1;
     while (m_min <= m_max) {
         m = (m_min + m_max) >> 1;
-        tb = &tcg_ctx.tb_ctx.tbs[m];
+        tb = tcg_ctx.tb_ctx.tbs[m];
         v = (uintptr_t)tb->tc_ptr;
         if (v == tc_ptr) {
             return tb;
@@ -1676,7 +1684,7 @@ static TranslationBlock *tb_find_pc(uintptr_t tc_ptr)
             m_min = m + 1;
         }
     }
-    return &tcg_ctx.tb_ctx.tbs[m_max];
+    return tcg_ctx.tb_ctx.tbs[m_max];
 }
 
 #if !defined(CONFIG_USER_ONLY)
@@ -1874,7 +1882,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     direct_jmp_count = 0;
     direct_jmp2_count = 0;
     for (i = 0; i < tcg_ctx.tb_ctx.nb_tbs; i++) {
-        tb = &tcg_ctx.tb_ctx.tbs[i];
+        tb = tcg_ctx.tb_ctx.tbs[i];
         target_code_size += tb->size;
         if (tb->size > max_target_code_size) {
             max_target_code_size = tb->size;
@@ -1894,8 +1902,7 @@ void dump_exec_info(FILE *f, fprintf_function cpu_fprintf)
     cpu_fprintf(f, "gen code size       %td/%zd\n",
                 tcg_ctx.code_gen_ptr - tcg_ctx.code_gen_buffer,
                 tcg_ctx.code_gen_highwater - tcg_ctx.code_gen_buffer);
-    cpu_fprintf(f, "TB count            %d/%d\n",
-            tcg_ctx.tb_ctx.nb_tbs, tcg_ctx.code_gen_max_blocks);
+    cpu_fprintf(f, "TB count            %d\n", tcg_ctx.tb_ctx.nb_tbs);
     cpu_fprintf(f, "TB avg target size  %d max=%d bytes\n",
             tcg_ctx.tb_ctx.nb_tbs ? target_code_size /
                     tcg_ctx.tb_ctx.nb_tbs : 0,
-- 
2.7.4

^ permalink raw reply related	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] [PATCH] tcg: allocate TB structs before the corresponding translated code
  2017-06-02 23:39   ` [Qemu-devel] [PATCH] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota
@ 2017-06-04 17:47     ` Richard Henderson
  0 siblings, 0 replies; 23+ messages in thread
From: Richard Henderson @ 2017-06-04 17:47 UTC (permalink / raw)
  To: Emilio G. Cota, qemu-devel
  Cc: Peter Maydell, Pranith Kumar, Paolo Bonzini, alex.bennee

On 06/02/2017 04:39 PM, Emilio G. Cota wrote:
> +    aligned = (void *)ROUND_UP((uintptr_t)s->code_gen_ptr, 64);

I would prefer that this and

> +} QEMU_ALIGNED(64);

this both use a define.  We may well have to adjust this for different hosts. 
In particular I'm thinking of PPC64 which would prefer 128.

> +    if (unlikely(!tcg_ctx.tb_ctx.tbs_size)) {
> +        tcg_ctx.tb_ctx.tbs_size = 1024;
> +    }

And I know that you resize this on demand, but surely we can avoid some startup 
slowdown by picking a more reasonable initial estimate here.  Like 32k or 64k.

Otherwise this looks good.  I'll have to have a more detailed look at the 
differences in the generated code later.

r~

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-03-25 16:52 [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Pranith Kumar
                   ` (2 preceding siblings ...)
  2017-03-27 15:54 ` Stefan Hajnoczi
@ 2017-06-06 17:13 ` Emilio G. Cota
  2017-06-07 10:15   ` Alex Bennée
  2017-06-07 11:12   ` Lluís Vilanova
  3 siblings, 2 replies; 23+ messages in thread
From: Emilio G. Cota @ 2017-06-06 17:13 UTC (permalink / raw)
  To: Pranith Kumar
  Cc: Richard Henderson, Peter Maydell, Paolo Bonzini,
	Alex Bennée, qemu-devel

On Sat, Mar 25, 2017 at 12:52:35 -0400, Pranith Kumar wrote:
(snip)
> * Implement an LRU translation block code cache.
> 
>   In the current TCG design, when the translation cache fills up, we flush all
>   the translated blocks (TBs) to free up space. We can improve this situation
>   by not flushing the TBs that were recently used i.e., by implementing an LRU
>   policy for freeing the blocks. This should avoid the re-translation overhead
>   for frequently used blocks and improve performance.

I doubt this will yield any benefits because:

- I still have not found a workload where the performance bottleneck is
  code retranslation due to unnecessary flushes (unless of course we
  artificially restrict the size of code_gen_buffer.)
- To keep track of LRU you need at least one extra instruction on every
  TB, e.g. to increase a counter or add a timestamp. This might be expensive
  and possibly a scalability bottleneck (e.g. what to do when several
  cores are executing the same TB?).
- tb_find_pc now does a simple binary search. This is easy because we
  know that TB's are allocated from code_gen_buffer in order. If they
  were out of order, we'd need another data structure (e.g. some sort of
  tree) to have quick searches. This is not a fast path though so this
  could be OK.

(snip)
> Please let me know if you have any comments or suggestions. Also please let me
> know if there are other enhancements that are easily implementable to increase
> TCG performance as part of this project or otherwise.

My not-necessarily-easy-to-implement wishlist would be:

- Reduction of tb_lock contention when booting many cores. For instance,
  booting 64 aarch64 cores on a 64-core host shows quite a bit of contention (host
  cores are 80% idle, i.e. waiting to acquire tb_lock); fortunately this is not a
  big deal (e.g. 4s for booting 1 core vs. ~14s to boot 64) and anyway most
  long-running workloads are cached a lot more effectively.
  Still, it would make sense to consider the option of not going through tb_lock
  etc. (via a private cache? or simply not caching at all) for code that is not
  executed many times. Another option is to translate privately, and only acquire
  tb_lock to copy the translated code to the shared buffer.

- Instrumentation. I think QEMU should have a good interface to enable
  dynamic binary instrumentation. This has many uses and in fact there
  are quite a few forks of QEMU doing this.
  I think Lluís Vilanova's work [1] is a good start to eventually get
  something upstream.

		Emilio

[1] https://projects.gso.ac.upc.edu/projects/qemu-dbi

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-06 17:13 ` Emilio G. Cota
@ 2017-06-07 10:15   ` Alex Bennée
  2017-06-07 11:12   ` Lluís Vilanova
  1 sibling, 0 replies; 23+ messages in thread
From: Alex Bennée @ 2017-06-07 10:15 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Pranith Kumar, Richard Henderson, Peter Maydell, Paolo Bonzini,
	qemu-devel


Emilio G. Cota <cota@braap.org> writes:

> On Sat, Mar 25, 2017 at 12:52:35 -0400, Pranith Kumar wrote:
> (snip)
>> * Implement an LRU translation block code cache.
>>
>>   In the current TCG design, when the translation cache fills up, we flush all
>>   the translated blocks (TBs) to free up space. We can improve this situation
>>   by not flushing the TBs that were recently used i.e., by implementing an LRU
>>   policy for freeing the blocks. This should avoid the re-translation overhead
>>   for frequently used blocks and improve performance.
>
> I doubt this will yield any benefits because:
>
> - I still have not found a workload where the performance bottleneck is
>   code retranslation due to unnecessary flushes (unless of course we
>   artificially restrict the size of code_gen_buffer.)
> - To keep track of LRU you need at least one extra instruction on every
>   TB, e.g. to increase a counter or add a timestamp. This might be expensive
>   and possibly a scalability bottleneck (e.g. what to do when several
>   cores are executing the same TB?).
> - tb_find_pc now does a simple binary search. This is easy because we
>   know that TB's are allocated from code_gen_buffer in order. If they
>   were out of order, we'd need another data structure (e.g. some sort of
>   tree) to have quick searches. This is not a fast path though so this
>   could be OK.

Certainly to make changes here we would need some proper numbers showing
it is a problem. Even my re-compile stress-ng test only flushes every
now an then.

>
> (snip)
>> Please let me know if you have any comments or suggestions. Also please let me
>> know if there are other enhancements that are easily implementable to increase
>> TCG performance as part of this project or otherwise.
>
> My not-necessarily-easy-to-implement wishlist would be:
>
> - Reduction of tb_lock contention when booting many cores. For instance,
>   booting 64 aarch64 cores on a 64-core host shows quite a bit of contention (host
>   cores are 80% idle, i.e. waiting to acquire tb_lock); fortunately this is not a
>   big deal (e.g. 4s for booting 1 core vs. ~14s to boot 64) and anyway most
>   long-running workloads are cached a lot more effectively.
>   Still, it would make sense to consider the option of not going through tb_lock
>   etc. (via a private cache? or simply not caching at all) for code that is not
>   executed many times. Another option is to translate privately, and only acquire
>   tb_lock to copy the translated code to the shared buffer.

Currently tb_lock protects the whole translation cycle. However to get
any sort of parallelism in a different translation cache we would also
need to make the translators thread safe. Currently translation involves
too many shared globals across the core TCG state as well as the
per-arch translate.c functions.

>
> - Instrumentation. I think QEMU should have a good interface to enable
>   dynamic binary instrumentation. This has many uses and in fact there
>   are quite a few forks of QEMU doing this.
>   I think Lluís Vilanova's work [1] is a good start to eventually get
>   something upstream.

I too want to see more here. It would be nice to have a hit count for
each block and some live introspection so we could investigate the
hotest blocks and examine the code the generate more closely.

I think there is scope for a big improvement if you could create a
hot-path series of basic blocks with multiple exit points and avoid the
spill/fills of registers in the hot path. However this is a fairly major
change to the current design.

Outside of performance improvements having a good instrumentation story
would be good for people who want to do analysis of guest behaviour.

>
> 		Emilio
>
> [1] https://projects.gso.ac.upc.edu/projects/qemu-dbi


--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-06 17:13 ` Emilio G. Cota
  2017-06-07 10:15   ` Alex Bennée
@ 2017-06-07 11:12   ` Lluís Vilanova
  2017-06-07 12:07     ` Peter Maydell
  1 sibling, 1 reply; 23+ messages in thread
From: Lluís Vilanova @ 2017-06-07 11:12 UTC (permalink / raw)
  To: Emilio G. Cota
  Cc: Pranith Kumar, Peter Maydell, Paolo Bonzini, Alex Bennée,
	qemu-devel, Richard Henderson

Emilio G Cota writes:

> - Instrumentation. I think QEMU should have a good interface to enable
>   dynamic binary instrumentation. This has many uses and in fact there
>   are quite a few forks of QEMU doing this.
>   I think Lluís Vilanova's work [1] is a good start to eventually get
>   something upstream.

> [1] https://projects.gso.ac.upc.edu/projects/qemu-dbi

Hey, I'm really happy you think that's worth pursuing. Even if it doesn't look
like it, I keep working on this on small bits of free time. I have a few patch
series that were ready to send, but should now be rebased to upstream before
that. In fact, I have an academic paper on the back-burner describing the work I
did (there's some cool tricks), but was waiting to get the core
intrumentation-agnostic infrastructure upstreamed first.

My understanding was that adding a public instrumentation interface would add
too much code maintenance overhead for a feature that is not in QEMU's core
target.

During time, I've kept simplifying large parts of the intrumentation code base,
and maybe things have changed in QEMU enough to rethink if that's worth
integrating. Of course, I'm completely open to discuss it.

Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 11:12   ` Lluís Vilanova
@ 2017-06-07 12:07     ` Peter Maydell
  2017-06-07 13:35       ` Paolo Bonzini
  2017-06-07 15:45       ` Lluís Vilanova
  0 siblings, 2 replies; 23+ messages in thread
From: Peter Maydell @ 2017-06-07 12:07 UTC (permalink / raw)
  To: Emilio G. Cota, Pranith Kumar, Peter Maydell, Paolo Bonzini,
	Alex Bennée, qemu-devel, Richard Henderson

On 7 June 2017 at 12:12, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> My understanding was that adding a public instrumentation interface would add
> too much code maintenance overhead for a feature that is not in QEMU's core
> target.

Well, it depends what you define as our core target :-)
I think we get quite a lot of users that want some useful ability
to see what their guest code is doing, and these days (when
dev board hardware is often very cheap and easily available)
I think that's a lot of the value that emulation can bring to
the table. Obviously we would want to try to do it in a way
that is low-runtime-overhead and is easy to get right for
people adding/maintaining cpu target frontend code...

thanks
-- PMM

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 12:07     ` Peter Maydell
@ 2017-06-07 13:35       ` Paolo Bonzini
  2017-06-07 15:52         ` Lluís Vilanova
  2017-06-07 15:45       ` Lluís Vilanova
  1 sibling, 1 reply; 23+ messages in thread
From: Paolo Bonzini @ 2017-06-07 13:35 UTC (permalink / raw)
  To: Peter Maydell, Emilio G. Cota, Pranith Kumar, Alex Bennée,
	qemu-devel, Richard Henderson, Alessandro Di Federico



On 07/06/2017 14:07, Peter Maydell wrote:
>> My understanding was that adding a public instrumentation interface would add
>> too much code maintenance overhead for a feature that is not in QEMU's core
>> target.
> Well, it depends what you define as our core target :-)
> I think we get quite a lot of users that want some useful ability
> to see what their guest code is doing, and these days (when
> dev board hardware is often very cheap and easily available)

and virtualization is too...

> I think that's a lot of the value that emulation can bring to
> the table. Obviously we would want to try to do it in a way
> that is low-runtime-overhead and is easy to get right for
> people adding/maintaining cpu target frontend code...

Indeed.  I even sometimes use TCG -d in_asm,exec,int for KVM unit tests,
because it's easier to debug them that way :) so introspection ability
is welcome.

Related to this is also Alessandro's work to librarify TCG (he has a
TCG->LLVM backend for example).

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 12:07     ` Peter Maydell
  2017-06-07 13:35       ` Paolo Bonzini
@ 2017-06-07 15:45       ` Lluís Vilanova
  2017-06-07 16:17         ` Peter Maydell
  2017-06-07 22:49         ` Emilio G. Cota
  1 sibling, 2 replies; 23+ messages in thread
From: Lluís Vilanova @ 2017-06-07 15:45 UTC (permalink / raw)
  To: Peter Maydell
  Cc: Emilio G. Cota, Pranith Kumar, Paolo Bonzini, Alex Bennée,
	qemu-devel, Richard Henderson

Peter Maydell writes:

> On 7 June 2017 at 12:12, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
>> My understanding was that adding a public instrumentation interface would add
>> too much code maintenance overhead for a feature that is not in QEMU's core
>> target.

> Well, it depends what you define as our core target :-)
> I think we get quite a lot of users that want some useful ability
> to see what their guest code is doing, and these days (when
> dev board hardware is often very cheap and easily available)
> I think that's a lot of the value that emulation can bring to
> the table. Obviously we would want to try to do it in a way
> that is low-runtime-overhead and is easy to get right for
> people adding/maintaining cpu target frontend code...

In that case I would say that QEMU is now much more in line with what I
proposed. The mechanisms I have (and most have been sent here in the form of
patch series) are architecture-agnostic (the generic code translation loop I
RFC'ed some time ago) and provide relatively good performance.

I did some tests tracing memory accesses of SPEC benchmarks in x86-64, and QEMU
is consistently faster than PIN in most cases. Even better, it works for any
guest architecture and for both apps and full systems.

This speed comes at the cost of exposing TCG operations to the instrumentation
library (i.e., the library can inject TCG code; AFAIR, calling out into a
function in the instrumentation library is slower than PIN). I have a separate
project that translates a higher-level language into the TCG instrumentation
primitives (providing something like PIN's instrumentation auto-inlining), but I
think that's a completely separate discussion.

If there is such renewed interest, I will carve a bit more time to bring the
patches up to date and send the instrumentation ones for further discussion.

Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 13:35       ` Paolo Bonzini
@ 2017-06-07 15:52         ` Lluís Vilanova
  2017-06-07 16:09           ` Alex Bennée
  2017-06-07 17:07           ` Paolo Bonzini
  0 siblings, 2 replies; 23+ messages in thread
From: Lluís Vilanova @ 2017-06-07 15:52 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Peter Maydell, Emilio G. Cota, Pranith Kumar, Alex Bennée,
	qemu-devel, Richard Henderson, Alessandro Di Federico

Paolo Bonzini writes:

> On 07/06/2017 14:07, Peter Maydell wrote:
>>> My understanding was that adding a public instrumentation interface would add
>>> too much code maintenance overhead for a feature that is not in QEMU's core
>>> target.
>> Well, it depends what you define as our core target :-)
>> I think we get quite a lot of users that want some useful ability
>> to see what their guest code is doing, and these days (when
>> dev board hardware is often very cheap and easily available)

> and virtualization is too...

Actually, in this case I was thinking of some way to transition between KVM and
TCG back and forth to be able to instrument a VM at any point in time.

>> I think that's a lot of the value that emulation can bring to
>> the table. Obviously we would want to try to do it in a way
>> that is low-runtime-overhead and is easy to get right for
>> people adding/maintaining cpu target frontend code...

> Indeed.  I even sometimes use TCG -d in_asm,exec,int for KVM unit tests,
> because it's easier to debug them that way :) so introspection ability
> is welcome.

AFAIR, Blue Swirl once proposed to use the instrumentation features to implement
unit tests.

> Related to this is also Alessandro's work to librarify TCG (he has a
> TCG-> LLVM backend for example).

Maybe I misunderstood, but that would be completely orthogonal, even though
instrumentation performance might benefit from LLVM's advanced IR
optimizers. But this goes a long way to hot code identification and asynchronous
optimization (since code that is not really hot will just run faster with
simpler optimizations, like in the TCG compiler). This actually sounds pretty
much like Java's HotSpot, certainly a non-trivial effort.

Cheers,
  Lluis

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 15:52         ` Lluís Vilanova
@ 2017-06-07 16:09           ` Alex Bennée
  2017-06-07 17:07           ` Paolo Bonzini
  1 sibling, 0 replies; 23+ messages in thread
From: Alex Bennée @ 2017-06-07 16:09 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: Paolo Bonzini, Peter Maydell, Emilio G. Cota, Pranith Kumar,
	qemu-devel, Richard Henderson, Alessandro Di Federico


Lluís Vilanova <vilanova@ac.upc.edu> writes:

> Paolo Bonzini writes:
>
>> On 07/06/2017 14:07, Peter Maydell wrote:
>>>> My understanding was that adding a public instrumentation interface would add
>>>> too much code maintenance overhead for a feature that is not in QEMU's core
>>>> target.
>>> Well, it depends what you define as our core target :-)
>>> I think we get quite a lot of users that want some useful ability
>>> to see what their guest code is doing, and these days (when
>>> dev board hardware is often very cheap and easily available)
>
>> and virtualization is too...
>
> Actually, in this case I was thinking of some way to transition between KVM and
> TCG back and forth to be able to instrument a VM at any point in time.

While we are blue sky thinking another fun thing might be doing system
emulation without SoftMMU but instead using the hosts virtualized page
tables (i.e. running TCG code inside KVM). Obviously there are mapping
issues given differing page sizes and the like but it would save the
SoftMMU overhead.
>
>
>>> I think that's a lot of the value that emulation can bring to
>>> the table. Obviously we would want to try to do it in a way
>>> that is low-runtime-overhead and is easy to get right for
>>> people adding/maintaining cpu target frontend code...
>
>> Indeed.  I even sometimes use TCG -d in_asm,exec,int for KVM unit tests,
>> because it's easier to debug them that way :) so introspection ability
>> is welcome.
>
> AFAIR, Blue Swirl once proposed to use the instrumentation features to implement
> unit tests.
>
>
>> Related to this is also Alessandro's work to librarify TCG (he has a
>> TCG-> LLVM backend for example).
>
> Maybe I misunderstood, but that would be completely orthogonal, even though
> instrumentation performance might benefit from LLVM's advanced IR
> optimizers. But this goes a long way to hot code identification and asynchronous
> optimization (since code that is not really hot will just run faster with
> simpler optimizations, like in the TCG compiler). This actually sounds pretty
> much like Java's HotSpot, certainly a non-trivial effort.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 15:45       ` Lluís Vilanova
@ 2017-06-07 16:17         ` Peter Maydell
  2017-06-07 22:49         ` Emilio G. Cota
  1 sibling, 0 replies; 23+ messages in thread
From: Peter Maydell @ 2017-06-07 16:17 UTC (permalink / raw)
  To: Peter Maydell, Emilio G. Cota, Pranith Kumar, Paolo Bonzini,
	Alex Bennée, qemu-devel, Richard Henderson

On 7 June 2017 at 16:45, Lluís Vilanova <vilanova@ac.upc.edu> wrote:
> This speed comes at the cost of exposing TCG operations to the instrumentation
> library (i.e., the library can inject TCG code; AFAIR, calling out into a
> function in the instrumentation library is slower than PIN).

Mmm, that's awkward. I'm not sure I'd really like to allow arbitrary
user instrumentation to inject TCG code: it's exposing our rather
changeable internals to the user, and it's a more complicated
interface to understand. For user-facing API (as opposed to
instrumentation interfaces within QEMU which we use to implement
something more simplified to present to the user) I would favour
a clean and straightforward API over pure speed.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 15:52         ` Lluís Vilanova
  2017-06-07 16:09           ` Alex Bennée
@ 2017-06-07 17:07           ` Paolo Bonzini
  1 sibling, 0 replies; 23+ messages in thread
From: Paolo Bonzini @ 2017-06-07 17:07 UTC (permalink / raw)
  To: Peter Maydell, Emilio G. Cota, Pranith Kumar, Alex Bennée,
	qemu-devel, Richard Henderson, Alessandro Di Federico



On 07/06/2017 17:52, Lluís Vilanova wrote:
> Paolo Bonzini writes:
> 
>> On 07/06/2017 14:07, Peter Maydell wrote:
>>>> My understanding was that adding a public instrumentation interface would add
>>>> too much code maintenance overhead for a feature that is not in QEMU's core
>>>> target.
>>> Well, it depends what you define as our core target :-)
>>> I think we get quite a lot of users that want some useful ability
>>> to see what their guest code is doing, and these days (when
>>> dev board hardware is often very cheap and easily available)
> 
>> and virtualization is too...
> 
> Actually, in this case I was thinking of some way to transition between KVM and
> TCG back and forth to be able to instrument a VM at any point in time.

That's not really easy because KVM exposes different hardware (on x86:
kvmclock, hypercalls, x2apic, more MSRs).  But we are digressing.

>> Related to this is also Alessandro's work to librarify TCG (he has a
>> TCG-> LLVM backend for example).
> 
> Maybe I misunderstood, but that would be completely orthogonal, even though
> instrumentation performance might benefit from LLVM's advanced IR
> optimizers.

It is different, but it shows the interest in bringing QEMU's
translation engine (the front-end in Alessandro's case, the back-end in
yours) beyond the simple usecase of dynamic recompilation.

Paolo

> But this goes a long way to hot code identification and asynchronous
> optimization (since code that is not really hot will just run faster with
> simpler optimizations, like in the TCG compiler). This actually sounds pretty
> much like Java's HotSpot, certainly a non-trivial effort.
> 
> 
> Cheers,
>   Lluis
> 

^ permalink raw reply	[flat|nested] 23+ messages in thread

* Re: [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
  2017-06-07 15:45       ` Lluís Vilanova
  2017-06-07 16:17         ` Peter Maydell
@ 2017-06-07 22:49         ` Emilio G. Cota
  1 sibling, 0 replies; 23+ messages in thread
From: Emilio G. Cota @ 2017-06-07 22:49 UTC (permalink / raw)
  To: Lluís Vilanova
  Cc: Peter Maydell, Pranith Kumar, Paolo Bonzini, Alex Bennée,
	qemu-devel, Richard Henderson

On Wed, Jun 07, 2017 at 18:45:10 +0300, Lluís Vilanova wrote:
> If there is such renewed interest, I will carve a bit more time to bring the
> patches up to date and send the instrumentation ones for further discussion.

I'm very interested and have time to spend on it -- I'm working on a
simulator backend and would like to move ASAP from QSim[1] to qemu proper for
the front-end. BTW I left some comments/questions a few days ago on the
v7 patchset you sent in January (ouch!).

I can also help with testing or bringing patches up to date -- let me
know if you need any help.

Thanks,

		Emilio

[1] http://manifold.gatech.edu/projects/qsim-a-multicore-emulator/

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-06-07 22:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-25 16:52 [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Pranith Kumar
2017-03-27 10:57 ` Richard Henderson
2017-03-27 13:22   ` Alex Bennée
2017-03-28  3:03   ` Pranith Kumar
2017-03-28  3:09     ` Pranith Kumar
2017-03-28 10:03       ` Stefan Hajnoczi
2017-06-02 23:39   ` [Qemu-devel] [PATCH] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota
2017-06-04 17:47     ` Richard Henderson
2017-03-27 11:32 ` [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Paolo Bonzini
2017-03-28  3:07   ` Pranith Kumar
2017-03-27 15:54 ` Stefan Hajnoczi
2017-03-27 17:13   ` Pranith Kumar
2017-06-06 17:13 ` Emilio G. Cota
2017-06-07 10:15   ` Alex Bennée
2017-06-07 11:12   ` Lluís Vilanova
2017-06-07 12:07     ` Peter Maydell
2017-06-07 13:35       ` Paolo Bonzini
2017-06-07 15:52         ` Lluís Vilanova
2017-06-07 16:09           ` Alex Bennée
2017-06-07 17:07           ` Paolo Bonzini
2017-06-07 15:45       ` Lluís Vilanova
2017-06-07 16:17         ` Peter Maydell
2017-06-07 22:49         ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.