All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: Suggestions for TCG performance improvements
       [not found] <c76bde31-8f3b-2d03-b7c7-9e026d4b5873@huawei.com>
@ 2021-12-02 15:31 ` Alex Bennée
  2021-12-03 16:21   ` Vasilev Oleg via
  2021-12-03  5:21 ` Emilio Cota
  1 sibling, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2021-12-02 15:31 UTC (permalink / raw)
  To: Vasilev Oleg
  Cc: peter.maydell, Konobeev Vladimir, Plotnik Nikolay,
	Richard Henderson, qemu-devel, Andrey Shinkevich, Emilio G. Cota,
	qemu-arm, Chengen (William, FixNet),
	Paolo Bonzini


Vasilev Oleg <vasilev.oleg@huawei.com> writes:

> Hi everyone,
>
> I've recently been tasked with improving QEMU performance and would like
> to discuss several possible optimizations which we could implement and
> later upstream.

Excellent - it's always good to see others that want to improve our
emulation performance ;-)

> We ran the sysbench[1] tool in threads mode on a linux installed as
> an aarch64 guest on x86_64 host. The QEMU profile flamegraph is attached
> to this message. One of the conclusions is that refilling TLB takes
> a substantial amount of time, and we are thinking of some solutions to
> abstain from refilling TLB so often.

Refilling the TLB is always going to be an expensive operation as
besides the complexity of the slow-path we also need to walk the guest
page tables to come up with the new contents.

> I've discovered some MMU-related suggestions in the 2018 letter[2], and
> those seem to be still not implemented (flush still uses memset[3]).
> Do you think we should go forward with implementing those?

I doubt you can do better than memset which should be the most optimised
memory clear for the platform. We could consider a separate thread to
proactively allocate and clear new TLBs so we don't have to do it at
flush time. However we wouldn't have complete information about what
size we want the new table to be.

When a TLB flush is performed it could be that the majority of the old
table is still perfectly valid. However we would need a reliable
mechanism to work out which entries in the table could be kept. I did
ponder a debug mode which would keep the last N tables dropped by
tlb_mmu_resize_locked and then measure the differences in the entries
before submitting the free to an rcu tasks.

> The mentioned paper[4] also describes other possible improvements.
> Some of those are already implemented (such as victim TLB and dynamic
> size for TLB), but others are not (e.g. TLB lookup uninlining and
> set-associative TLB layer). Do you think those improvements
> worth trying?

Anything is worth trying but you would need hard numbers. Also its all
too easy to target micro benchmarks which might not show much difference
in real world use. The best thing you can do at the moment is give the
guest plenty of RAM so page updates are limited because the guest OS
doesn't have to swap RAM around.

Another optimisation would be looking at bigger page sizes. For example
the kernel (in a Linux setup) usually has a contiguous flat map for
kernel space. If we could represent that at a larger granularity then
not only could we make the page lookup tighter for kernel mode we could
also achieve things like cross-page TB chaining for kernel functions.

> Another idea for decreasing occurence of TLB refills is to make TBs key
> in htable independent of physical address. I assume it is only needed
> to distinguish different processes where VAs can be the same.
> Is that assumption correct?
>
> Do you have any other ideas which parts of TCG could require our
> attention w.r.t the flamegraph I attached?

It's been done before but not via upstream patches but improving code
generation for hot loops would be a potential performance win. That
would require some changes to the translation model to allow for
multiple exit points and probably introducing a new code generator
(gccjit or llvm) to generate highly optimised code.

> I am also CCing my teammates. We are eager to improve the QEMU TCG
> performance for our needs and to contribute our patches to upstream.

Do you have any particular goal in mind or just "better"? The current
MTTCG scaling tends to drop off as we go above 10-12 vCPUs due to the
cost of synchronous flushing across all those vCPUs.

>
> [1]: https://github.com/akopytov/sysbench
> [2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg562103.html
> [3]: 
> https://github.com/qemu/qemu/blob/14d02cfbe4adaeebe7cb833a8cc71191352cf03b/accel/tcg/cputlb.c#L239
> [4]: https://dl.acm.org/doi/pdf/10.1145/2686034
>
> [2. flamegraph.svg --- image/svg+xml; flamegraph.svg]...
>
> [3. callgraph.svg --- image/svg+xml; callgraph.svg]...


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Suggestions for TCG performance improvements
       [not found] <c76bde31-8f3b-2d03-b7c7-9e026d4b5873@huawei.com>
  2021-12-02 15:31 ` Suggestions for TCG performance improvements Alex Bennée
@ 2021-12-03  5:21 ` Emilio Cota
  2021-12-03  6:30   ` Richard Henderson
  1 sibling, 1 reply; 7+ messages in thread
From: Emilio Cota @ 2021-12-03  5:21 UTC (permalink / raw)
  To: Vasilev Oleg
  Cc: peter.maydell, Konobeev Vladimir, Chengen (William, FixNet),
	Richard Henderson, qemu-devel, Andrey Shinkevich, qemu-arm,
	Plotnik Nikolay, Paolo Bonzini, Alex Bennée

On Thu, Dec 2, 2021 at 4:47 AM Vasilev Oleg <vasilev.oleg@huawei.com> wrote:
> The mentioned paper[4] also describes other possible improvements.
> Some of those are already implemented (such as victim TLB and dynamic
> size for TLB), but others are not (e.g. TLB lookup uninlining and
> set-associative TLB layer). Do you think those improvements
> worth trying?

I cannot find the emails, but I do remember that Richard wrote tcg-i386 patches
for uninlining TLB lookups. Unfortunately they resulted in a slowdown on
modern machines.

        Emilio


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Suggestions for TCG performance improvements
  2021-12-03  5:21 ` Emilio Cota
@ 2021-12-03  6:30   ` Richard Henderson
  0 siblings, 0 replies; 7+ messages in thread
From: Richard Henderson @ 2021-12-03  6:30 UTC (permalink / raw)
  To: Emilio Cota, Vasilev Oleg
  Cc: peter.maydell, Konobeev Vladimir, Chengen (William, FixNet),
	qemu-devel, Andrey Shinkevich, qemu-arm, Plotnik Nikolay,
	Paolo Bonzini, Alex Bennée

On 12/2/21 9:21 PM, Emilio Cota wrote:
> On Thu, Dec 2, 2021 at 4:47 AM Vasilev Oleg <vasilev.oleg@huawei.com> wrote:
>> The mentioned paper[4] also describes other possible improvements.
>> Some of those are already implemented (such as victim TLB and dynamic
>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>> set-associative TLB layer). Do you think those improvements
>> worth trying?
> 
> I cannot find the emails, but I do remember that Richard wrote tcg-i386 patches
> for uninlining TLB lookups. Unfortunately they resulted in a slowdown on
> modern machines.

That code is still around at
https://github.com/rth7680/qemu/tree/tcg-softmmu-ool


r~


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Suggestions for TCG performance improvements
  2021-12-02 15:31 ` Suggestions for TCG performance improvements Alex Bennée
@ 2021-12-03 16:21   ` Vasilev Oleg via
  2021-12-03 17:27     ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Vasilev Oleg via @ 2021-12-03 16:21 UTC (permalink / raw)
  To: Alex Bennée
  Cc: qemu-devel, Richard Henderson, Paolo Bonzini, Emilio G. Cota,
	peter.maydell, qemu-arm, Plotnik Nikolay, Andrey Shinkevich,
	Konobeev Vladimir, Chengen (William, FixNet)

On 12/2/2021 7:02 PM, Alex Bennée wrote:

> Vasilev Oleg <vasilev.oleg@huawei.com> writes:
>
>> I've discovered some MMU-related suggestions in the 2018 letter[2], and
>> those seem to be still not implemented (flush still uses memset[3]).
>> Do you think we should go forward with implementing those?
> I doubt you can do better than memset which should be the most optimised
> memory clear for the platform. We could consider a separate thread to
> proactively allocate and clear new TLBs so we don't have to do it at
> flush time. However we wouldn't have complete information about what
> size we want the new table to be.
>
> When a TLB flush is performed it could be that the majority of the old
> table is still perfectly valid. 

In that case, do you think it would be possible instead of flushing TLBs, store it somewhere and bring it back when the address space changes back? 

> However we would need a reliable mechanism to work out which entries in the table could be kept. 

We could invalidate entries in those stored TLBs the same way we invalidate the active TLB. If we are going to have new thread to manage TLB allocation, invalidation could also be offloaded to those.

> I did ponder a debug mode which would keep the last N tables dropped by
> tlb_mmu_resize_locked and then measure the differences in the entries
> before submitting the free to an rcu tasks.
>> The mentioned paper[4] also describes other possible improvements.
>> Some of those are already implemented (such as victim TLB and dynamic
>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>> set-associative TLB layer). Do you think those improvements
>> worth trying?
> Anything is worth trying but you would need hard numbers. Also its all
> too easy to target micro benchmarks which might not show much difference
> in real world use. 

The  mentioned paper presents some benchmarking, e. g. linux kernel compilation and some other stuff. Do you think those shouldn't be trusted?

> The best thing you can do at the moment is give the
> guest plenty of RAM so page updates are limited because the guest OS
> doesn't have to swap RAM around.
>
> Another optimisation would be looking at bigger page sizes. For example
> the kernel (in a Linux setup) usually has a contiguous flat map for
> kernel space. If we could represent that at a larger granularity then
> not only could we make the page lookup tighter for kernel mode we could
> also achieve things like cross-page TB chaining for kernel functions.

Do I understand correctly that currently softmmu doesn't treat hugepages any special, and you are suggesting we add such support, so that a particular region of memory occupies less TLBentries? This probably means TLB lookup would become quite a bit more complex.

>> Another idea for decreasing occurence of TLB refills is to make TBs key
>> in htable independent of physical address. I assume it is only needed
>> to distinguish different processes where VAs can be the same.
>> Is that assumption correct?

This one, what do you think? Can we replace physical address as part of a key in TB htable with some sort of address space identifier?

>> Do you have any other ideas which parts of TCG could require our
>> attention w.r.t the flamegraph I attached?
> It's been done before but not via upstream patches but improving code
> generation for hot loops would be a potential performance win. 

I am not sure optimizing the code generation itself would help much, at least in our case. The flamegraph I attached to previous letter shows that only about 10% of time qemu spends in generated code. The rest is helpers, searching for next block, TLB-related stuff and so on.

> That would require some changes to the translation model to allow for
> multiple exit points and probably introducing a new code generator
> (gccjit or llvm) to generate highly optimised code.

This, however, could bring a lot of performance gain, translation blocks would become bigger, and we would spend less time searching for the next block.

>> I am also CCing my teammates. We are eager to improve the QEMU TCG
>> performance for our needs and to contribute our patches to upstream.
> Do you have any particular goal in mind or just "better"? The current
> MTTCG scaling tends to drop off as we go above 10-12 vCPUs due to the
> cost of synchronous flushing across all those vCPUs.

We have some internal ways to measure performance, but we are looking for alternative metric, that we could share and you could reproduce. Sysbench in threads mode is the closed we have found so far by comparing flamegraphs, but we are testing more benchmarking software.

>> [1]: https://github.com/akopytov/sysbench
>> [2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg562103.html
>> [3]: 
>> https://github.com/qemu/qemu/blob/14d02cfbe4adaeebe7cb833a8cc71191352cf03b/accel/tcg/cputlb.c#L239
>> [4]: https://dl.acm.org/doi/pdf/10.1145/2686034
>>
>> [2. flamegraph.svg --- image/svg+xml; flamegraph.svg]...
>>
>> [3. callgraph.svg --- image/svg+xml; callgraph.svg]...
>>
Thanks,
Oleg



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Suggestions for TCG performance improvements
  2021-12-03 16:21   ` Vasilev Oleg via
@ 2021-12-03 17:27     ` Alex Bennée
  2021-12-06 19:40       ` Vasilev Oleg via
  0 siblings, 1 reply; 7+ messages in thread
From: Alex Bennée @ 2021-12-03 17:27 UTC (permalink / raw)
  To: Vasilev Oleg
  Cc: peter.maydell, Konobeev Vladimir, Plotnik Nikolay,
	Richard Henderson, qemu-devel, Andrey Shinkevich, Emilio G. Cota,
	qemu-arm, Chengen (William, FixNet),
	Paolo Bonzini


Vasilev Oleg <vasilev.oleg@huawei.com> writes:

> On 12/2/2021 7:02 PM, Alex Bennée wrote:
>
>> Vasilev Oleg <vasilev.oleg@huawei.com> writes:
>>
>>> I've discovered some MMU-related suggestions in the 2018 letter[2], and
>>> those seem to be still not implemented (flush still uses memset[3]).
>>> Do you think we should go forward with implementing those?
>> I doubt you can do better than memset which should be the most optimised
>> memory clear for the platform. We could consider a separate thread to
>> proactively allocate and clear new TLBs so we don't have to do it at
>> flush time. However we wouldn't have complete information about what
>> size we want the new table to be.
>>
>> When a TLB flush is performed it could be that the majority of the old
>> table is still perfectly valid. 
>
> In that case, do you think it would be possible instead of flushing
> TLBs, store it somewhere and bring it back when the address space
> changes back?

It would need a new interface into cputlb but I don't see why not.

>
>> However we would need a reliable mechanism to work out which entries in the table could be kept. 
>
> We could invalidate entries in those stored TLBs the same way we
> invalidate the active TLB. If we are going to have new thread to
> manage TLB allocation, invalidation could also be offloaded to those.
>
>> I did ponder a debug mode which would keep the last N tables dropped by
>> tlb_mmu_resize_locked and then measure the differences in the entries
>> before submitting the free to an rcu tasks.
>>> The mentioned paper[4] also describes other possible improvements.
>>> Some of those are already implemented (such as victim TLB and dynamic
>>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>>> set-associative TLB layer). Do you think those improvements
>>> worth trying?
>> Anything is worth trying but you would need hard numbers. Also its all
>> too easy to target micro benchmarks which might not show much difference
>> in real world use. 
>
> The  mentioned paper presents some benchmarking, e. g. linux kernel
> compilation and some other stuff. Do you think those shouldn't be
> trusted?

No they are good. To be honest it's the context switches that get you.
Look at "info jit" between a normal distro and a initramfs shell. Places
where the kernel is switching between multiple maps means a churn of TLB
data.

See my other post with a match of "msr ttrb"

>
>> The best thing you can do at the moment is give the
>> guest plenty of RAM so page updates are limited because the guest OS
>> doesn't have to swap RAM around.
>>
>> Another optimisation would be looking at bigger page sizes. For example
>> the kernel (in a Linux setup) usually has a contiguous flat map for
>> kernel space. If we could represent that at a larger granularity then
>> not only could we make the page lookup tighter for kernel mode we could
>> also achieve things like cross-page TB chaining for kernel functions.
>
> Do I understand correctly that currently softmmu doesn't treat
> hugepages any special, and you are suggesting we add such support, so
> that a particular region of memory occupies less TLBentries? This
> probably means TLB lookup would become quite a bit more complex.
>
>>> Another idea for decreasing occurence of TLB refills is to make TBs key
>>> in htable independent of physical address. I assume it is only needed
>>> to distinguish different processes where VAs can be the same.
>>> Is that assumption correct?
>
> This one, what do you think? Can we replace physical address as part
> of a key in TB htable with some sort of address space identifier?

Hmm maybe - so a change in ASID wouldn't need a total flush?

>
>>> Do you have any other ideas which parts of TCG could require our
>>> attention w.r.t the flamegraph I attached?
>> It's been done before but not via upstream patches but improving code
>> generation for hot loops would be a potential performance win. 
>
> I am not sure optimizing the code generation itself would help much,
> at least in our case. The flamegraph I attached to previous letter
> shows that only about 10% of time qemu spends in generated code. The
> rest is helpers, searching for next block, TLB-related stuff and so
> on.
>
>> That would require some changes to the translation model to allow for
>> multiple exit points and probably introducing a new code generator
>> (gccjit or llvm) to generate highly optimised code.
>
> This, however, could bring a lot of performance gain, translation blocks would become bigger, and we would spend less time searching for the next block.
>
>>> I am also CCing my teammates. We are eager to improve the QEMU TCG
>>> performance for our needs and to contribute our patches to upstream.
>> Do you have any particular goal in mind or just "better"? The current
>> MTTCG scaling tends to drop off as we go above 10-12 vCPUs due to the
>> cost of synchronous flushing across all those vCPUs.
>
> We have some internal ways to measure performance, but we are looking
> for alternative metric, that we could share and you could reproduce.
> Sysbench in threads mode is the closed we have found so far by
> comparing flamegraphs, but we are testing more benchmarking software.

OK.

>
>>> [1]: https://github.com/akopytov/sysbench
>>> [2]: https://www.mail-archive.com/qemu-devel@nongnu.org/msg562103.html
>>> [3]: 
>>> https://github.com/qemu/qemu/blob/14d02cfbe4adaeebe7cb833a8cc71191352cf03b/accel/tcg/cputlb.c#L239
>>> [4]: https://dl.acm.org/doi/pdf/10.1145/2686034
>>>
>>> [2. flamegraph.svg --- image/svg+xml; flamegraph.svg]...
>>>
>>> [3. callgraph.svg --- image/svg+xml; callgraph.svg]...
>>>
> Thanks,
> Oleg


-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Suggestions for TCG performance improvements
  2021-12-03 17:27     ` Alex Bennée
@ 2021-12-06 19:40       ` Vasilev Oleg via
  2021-12-06 21:09         ` Alex Bennée
  0 siblings, 1 reply; 7+ messages in thread
From: Vasilev Oleg via @ 2021-12-06 19:40 UTC (permalink / raw)
  To: Alex Bennée
  Cc: peter.maydell, Konobeev Vladimir, Plotnik Nikolay,
	Richard Henderson, qemu-devel, Andrey Shinkevich, Emilio G. Cota,
	qemu-arm, Chengen (William, FixNet),
	Paolo Bonzini

On 12/3/2021 8:32 PM, Alex Bennée wrote:
> Vasilev Oleg <vasilev.oleg@huawei.com> writes:
>
>> On 12/2/2021 7:02 PM, Alex Bennée wrote:
>>
>>> Vasilev Oleg <vasilev.oleg@huawei.com> writes:
...skipped...
>>> I did ponder a debug mode which would keep the last N tables dropped by
>>> tlb_mmu_resize_locked and then measure the differences in the entries
>>> before submitting the free to an rcu tasks.
>>>> The mentioned paper[4] also describes other possible improvements.
>>>> Some of those are already implemented (such as victim TLB and dynamic
>>>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>>>> set-associative TLB layer). Do you think those improvements
>>>> worth trying?
>>> Anything is worth trying but you would need hard numbers. Also its all
>>> too easy to target micro benchmarks which might not show much difference
>>> in real world use. 
>> The  mentioned paper presents some benchmarking, e. g. linux kernel
>> compilation and some other stuff. Do you think those shouldn't be
>> trusted?
> No they are good. To be honest it's the context switches that get you.
> Look at "info jit" between a normal distro and a initramfs shell. Places
> where the kernel is switching between multiple maps means a churn of TLB
> data.
>
> See my other post with a match of "msr ttrb"
Sorry, couldn't find what you are referring to. Could you, please, share
a link?
>>>> Another idea for decreasing occurence of TLB refills is to make TBs key
>>>> in htable independent of physical address. I assume it is only needed
>>>> to distinguish different processes where VAs can be the same.
>>>> Is that assumption correct?
>> This one, what do you think? Can we replace physical address as part
>> of a key in TB htable with some sort of address space identifier?
> Hmm maybe - so a change in ASID wouldn't need a total flush?

No, I think it would need a flush since regular memory accesses need to
be in the correct address space. But, we won't need to access TLB when
looking for the next TB. Also, TLB wouldn't need to be filled with code
pages, only data pages.

Overall, thanks for your feedback on those ideas.

Oleg


...skipped...




^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Suggestions for TCG performance improvements
  2021-12-06 19:40       ` Vasilev Oleg via
@ 2021-12-06 21:09         ` Alex Bennée
  0 siblings, 0 replies; 7+ messages in thread
From: Alex Bennée @ 2021-12-06 21:09 UTC (permalink / raw)
  To: Vasilev Oleg
  Cc: peter.maydell, Konobeev Vladimir, Chengen (William, FixNet),
	Richard Henderson, qemu-devel, Andrey Shinkevich, Emilio G. Cota,
	Plotnik Nikolay, Paolo Bonzini, qemu-arm


Vasilev Oleg <vasilev.oleg@huawei.com> writes:

> On 12/3/2021 8:32 PM, Alex Bennée wrote:
>> Vasilev Oleg <vasilev.oleg@huawei.com> writes:
>>
>>> On 12/2/2021 7:02 PM, Alex Bennée wrote:
>>>
>>>> Vasilev Oleg <vasilev.oleg@huawei.com> writes:
> ...skipped...
>>>> I did ponder a debug mode which would keep the last N tables dropped by
>>>> tlb_mmu_resize_locked and then measure the differences in the entries
>>>> before submitting the free to an rcu tasks.
>>>>> The mentioned paper[4] also describes other possible improvements.
>>>>> Some of those are already implemented (such as victim TLB and dynamic
>>>>> size for TLB), but others are not (e.g. TLB lookup uninlining and
>>>>> set-associative TLB layer). Do you think those improvements
>>>>> worth trying?
>>>> Anything is worth trying but you would need hard numbers. Also its all
>>>> too easy to target micro benchmarks which might not show much difference
>>>> in real world use. 
>>> The  mentioned paper presents some benchmarking, e. g. linux kernel
>>> compilation and some other stuff. Do you think those shouldn't be
>>> trusted?
>> No they are good. To be honest it's the context switches that get you.
>> Look at "info jit" between a normal distro and a initramfs shell. Places
>> where the kernel is switching between multiple maps means a churn of TLB
>> data.
>>
>> See my other post with a match of "msr ttrb"
> Sorry, couldn't find what you are referring to. Could you, please, share
> a link?

It was an enhancement to the libinsns.so plugin to gauge how often
certain instructions are run:

  Subject: [RFC PATCH  0/2] insn plugin tweaks for measuring frequency
  Date: Fri,  3 Dec 2021 14:44:19 +0000
  Message-Id: <20211203144421.1445232-1-alex.bennee@linaro.org>

I think the msr ttrb[10]_el1 is a key instruction because that triggers
a flush if the ASID changes. On my initramfs setup with a simple login
shell that doesn't happen, on a full distro there is context switching
all the time which causes extra flushes.

-- 
Alex Bennée


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-12-06 21:13 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <c76bde31-8f3b-2d03-b7c7-9e026d4b5873@huawei.com>
2021-12-02 15:31 ` Suggestions for TCG performance improvements Alex Bennée
2021-12-03 16:21   ` Vasilev Oleg via
2021-12-03 17:27     ` Alex Bennée
2021-12-06 19:40       ` Vasilev Oleg via
2021-12-06 21:09         ` Alex Bennée
2021-12-03  5:21 ` Emilio Cota
2021-12-03  6:30   ` Richard Henderson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.