All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements
@ 2017-03-25 16:52 Pranith Kumar
  2017-03-27 10:57 ` Richard Henderson
                   ` (3 more replies)
  0 siblings, 4 replies; 23+ messages in thread
From: Pranith Kumar @ 2017-03-25 16:52 UTC (permalink / raw)
  To: Richard Henderson, Peter Maydell, Paolo Bonzini, Emilio G. Cota
  Cc: Alex Bennée, qemu-devel

Hello,

With MTTCG code now merged in mainline, I tried to see if we are able to run
x86 SMP guests on ARM64 hosts. For this I tried running a windows XP guest on
a dragonboard 410c which has 1GB RAM. Since x86 has a strong memory model
whereas ARM64 is a weak memory model, I added a patch to generate fence
instructions for every guest memory access. After some minor fixes, I was
successfully able to boot a 4 core guest all the way to the desktop (albeit
with a 1GB backing swap). However the performance is severely
limited and the guest is barely usable. Based on my observations, I think
there are some easily implementable additions we can make to improve the
performance of TCG in general and on ARM64 in particular. I propose to do the
following as part of Google Summer of Code 2017.


* Implement jump-to-register instruction on ARM64 to overcome the 128MB
  translation cache size limit.

  The translation cache size for an ARM64 host is currently limited to 128
  MB. This limitation is imposed by utilizing a branch instruction which
  encodes the jump offset and is limited by the number of bits it can use for
  the range of the offset. The performance impact by this limitation is severe
  and can be observed when you try to run large programs like a browser in the
  guest. The cache is flushed several times before the browser starts and the
  performance is not satisfactory. This limitation can be overcome by
  generating a branch-to-register instruction and utilizing that when the
  destination address is outside the range of what can be encoded in current
  branch instruction.

* Implement an LRU translation block code cache.

  In the current TCG design, when the translation cache fills up, we flush all
  the translated blocks (TBs) to free up space. We can improve this situation
  by not flushing the TBs that were recently used i.e., by implementing an LRU
  policy for freeing the blocks. This should avoid the re-translation overhead
  for frequently used blocks and improve performance.

* Avoid consistency overhead for strong memory model guests by generating
  load-acquire and store-release instructions.

  To run a strongly ordered guest on a weakly ordered host using MTTCG, for
  example, x86 on ARM64, we have to generate fence instructions for all the
  guest memory accesses to ensure consistency. The overhead imposed by these
  fence instructions is significant (almost 3x when compared to a run without
  fence instructions). ARM64 provides load-acquire and store-release
  instructions which are sequentially consistent and can be used instead of
  generating fence instructions. I plan to add support to generate these
  instructions in the TCG run-time to reduce the consistency overhead in
  MTTCG.

Alex Bennée, who mentored me last year, has agreed to mentor me again this
time if the proposal is accepted.

Please let me know if you have any comments or suggestions. Also please let me
know if there are other enhancements that are easily implementable to increase
TCG performance as part of this project or otherwise.

Thanks,
-- 
Pranith

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2017-06-07 22:49 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-03-25 16:52 [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Pranith Kumar
2017-03-27 10:57 ` Richard Henderson
2017-03-27 13:22   ` Alex Bennée
2017-03-28  3:03   ` Pranith Kumar
2017-03-28  3:09     ` Pranith Kumar
2017-03-28 10:03       ` Stefan Hajnoczi
2017-06-02 23:39   ` [Qemu-devel] [PATCH] tcg: allocate TB structs before the corresponding translated code Emilio G. Cota
2017-06-04 17:47     ` Richard Henderson
2017-03-27 11:32 ` [Qemu-devel] GSoC 2017 Proposal: TCG performance enhancements Paolo Bonzini
2017-03-28  3:07   ` Pranith Kumar
2017-03-27 15:54 ` Stefan Hajnoczi
2017-03-27 17:13   ` Pranith Kumar
2017-06-06 17:13 ` Emilio G. Cota
2017-06-07 10:15   ` Alex Bennée
2017-06-07 11:12   ` Lluís Vilanova
2017-06-07 12:07     ` Peter Maydell
2017-06-07 13:35       ` Paolo Bonzini
2017-06-07 15:52         ` Lluís Vilanova
2017-06-07 16:09           ` Alex Bennée
2017-06-07 17:07           ` Paolo Bonzini
2017-06-07 15:45       ` Lluís Vilanova
2017-06-07 16:17         ` Peter Maydell
2017-06-07 22:49         ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.