Re: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)

All of lore.kernel.org
 help / color / mirror / Atom feed

* Re: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)
       [not found] ` <87va71uijc.fsf@linaro.org>
@ 2018-10-01 18:34   ` Emilio G. Cota
  2018-10-01 20:40     ` Richard Henderson
  2018-10-02  6:48     ` Alex Bennée
  0 siblings, 2 replies; 5+ messages in thread
From: Emilio G. Cota @ 2018-10-01 18:34 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Pranith Kumar, Richard Henderson

On Thu, Sep 20, 2018 at 01:19:51 +0100, Alex Bennée wrote:
> If we are going to have an indirection then we can also drop the
> requirement to scale the TLB according to the number of MMU indexes we
> have to support. It's fairly wasteful when a bunch of them are almost
> never used unless you are running stuff that uses them.

So with dynamic TLB sizing, what you're suggesting here is to resize
each MMU array independently (depending on their use rate) instead
of using a single "TLB size" for all MMU indexes. Am I understanding
your point correctly?

Thanks,

		E.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)
  2018-10-01 18:34   ` [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted) Emilio G. Cota
@ 2018-10-01 20:40     ` Richard Henderson
  2018-10-02  1:54       ` Emilio G. Cota
  2018-10-02  6:48     ` Alex Bennée
  1 sibling, 1 reply; 5+ messages in thread
From: Richard Henderson @ 2018-10-01 20:40 UTC (permalink / raw)
  To: Emilio G. Cota, Alex Bennée; +Cc: qemu-devel, Pranith Kumar

On 10/1/18 1:34 PM, Emilio G. Cota wrote:
> On Thu, Sep 20, 2018 at 01:19:51 +0100, Alex Bennée wrote:
>> If we are going to have an indirection then we can also drop the
>> requirement to scale the TLB according to the number of MMU indexes we
>> have to support. It's fairly wasteful when a bunch of them are almost
>> never used unless you are running stuff that uses them.
> 
> So with dynamic TLB sizing, what you're suggesting here is to resize
> each MMU array independently (depending on their use rate) instead
> of using a single "TLB size" for all MMU indexes. Am I understanding
> your point correctly?

You cannot do that without flushing the TBs (and with out-of-line memory ops,
the prologue as well) and regenerating.  The TLB size is baked into the code.
And we really don't have any extra registers free to vary that.


r~

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)
  2018-10-01 20:40     ` Richard Henderson
@ 2018-10-02  1:54       ` Emilio G. Cota
  0 siblings, 0 replies; 5+ messages in thread
From: Emilio G. Cota @ 2018-10-02  1:54 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Alex Bennée, qemu-devel, Pranith Kumar

On Mon, Oct 01, 2018 at 15:40:37 -0500, Richard Henderson wrote:
> On 10/1/18 1:34 PM, Emilio G. Cota wrote:
> > On Thu, Sep 20, 2018 at 01:19:51 +0100, Alex Bennée wrote:
> >> If we are going to have an indirection then we can also drop the
> >> requirement to scale the TLB according to the number of MMU indexes we
> >> have to support. It's fairly wasteful when a bunch of them are almost
> >> never used unless you are running stuff that uses them.
> > 
> > So with dynamic TLB sizing, what you're suggesting here is to resize
> > each MMU array independently (depending on their use rate) instead
> > of using a single "TLB size" for all MMU indexes. Am I understanding
> > your point correctly?
> 
> You cannot do that without flushing the TBs (and with out-of-line memory ops,
> the prologue as well) and regenerating.  The TLB size is baked into the code.
> And we really don't have any extra registers free to vary that.

Can you please elaborate on this? I can't see where this is
baked into the generated code, other than the TLB lookup.
Grepping for CPU_TLB_SIZE and CPU_TLB_BITS only shows a few
places.

I have written today a prototype of dynamic TLB flushing. It
uses no extra registers because mmu_idx is known at generation time.
I haven't done any extensive testing yet, but at least it boots
aarch64 and x86_64 guests on an x86_64 host.

The code (some messy WIP commits in there, sorry) is at:
  https://github.com/cota/qemu/tree/tlb2

Please take a look -- am I doing anything horribly wrong there?

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)
  2018-10-01 18:34   ` [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted) Emilio G. Cota
  2018-10-01 20:40     ` Richard Henderson
@ 2018-10-02  6:48     ` Alex Bennée
  2018-10-02 18:09       ` Emilio G. Cota
  1 sibling, 1 reply; 5+ messages in thread
From: Alex Bennée @ 2018-10-02  6:48 UTC (permalink / raw)
  To: Emilio G. Cota; +Cc: qemu-devel, Pranith Kumar, Richard Henderson

Emilio G. Cota <cota@braap.org> writes:

> On Thu, Sep 20, 2018 at 01:19:51 +0100, Alex Bennée wrote:
>> If we are going to have an indirection then we can also drop the
>> requirement to scale the TLB according to the number of MMU indexes we
>> have to support. It's fairly wasteful when a bunch of them are almost
>> never used unless you are running stuff that uses them.
>
> So with dynamic TLB sizing, what you're suggesting here is to resize
> each MMU array independently (depending on their use rate) instead
> of using a single "TLB size" for all MMU indexes. Am I understanding
> your point correctly?

Not quite - I think it would overly complicate the lookup to have a
differently sized TLB lookup for each mmu index - even if their usage
patterns are different.

I just meant that if we already have the cost of an indirection we don't
have to ensure:

    CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];
    CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];

restrict their sizes so any entry in the 2D array can be indexed
directly from env. Currently CPU_TLB_SIZE/CPU_TLB_BITS is restricted by
the number of NB_MMU_MODES we have to support. But if each can be
flushed and managed separately we can have:

    CPUTLBEntry *tlb_table[NB_MMU_MODES];

And size CPU_TLB_SIZE for the maximum offset we can mange in the lookup
code. This is mainly driven by the varying
TCG_TARGET_TLB_DISPLACEMENT_BITS each backend has available to it.

>
> Thanks,
>
> 		E.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted)
  2018-10-02  6:48     ` Alex Bennée
@ 2018-10-02 18:09       ` Emilio G. Cota
  0 siblings, 0 replies; 5+ messages in thread
From: Emilio G. Cota @ 2018-10-02 18:09 UTC (permalink / raw)
  To: Alex Bennée; +Cc: qemu-devel, Pranith Kumar, Richard Henderson

On Tue, Oct 02, 2018 at 07:48:20 +0100, Alex Bennée wrote:
> 
> Emilio G. Cota <cota@braap.org> writes:
> 
> > On Thu, Sep 20, 2018 at 01:19:51 +0100, Alex Bennée wrote:
> >> If we are going to have an indirection then we can also drop the
> >> requirement to scale the TLB according to the number of MMU indexes we
> >> have to support. It's fairly wasteful when a bunch of them are almost
> >> never used unless you are running stuff that uses them.
> >
> > So with dynamic TLB sizing, what you're suggesting here is to resize
> > each MMU array independently (depending on their use rate) instead
> > of using a single "TLB size" for all MMU indexes. Am I understanding
> > your point correctly?
> 
> Not quite - I think it would overly complicate the lookup to have a
> differently sized TLB lookup for each mmu index - even if their usage
> patterns are different.

It just adds a load to get the mask, which will most likely be
in the L1. The value is not used after 3 instructions later, when
the L1 read will have completed.

> I just meant that if we already have the cost of an indirection we don't
> have to ensure:
> 
>     CPUTLBEntry tlb_table[NB_MMU_MODES][CPU_TLB_SIZE];
>     CPUIOTLBEntry iotlb[NB_MMU_MODES][CPU_TLB_SIZE];
> 
> restrict their sizes so any entry in the 2D array can be indexed
> directly from env. Currently CPU_TLB_SIZE/CPU_TLB_BITS is restricted by
> the number of NB_MMU_MODES we have to support. But if each can be
> flushed and managed separately we can have:
> 
>     CPUTLBEntry *tlb_table[NB_MMU_MODES];
> 
> And size CPU_TLB_SIZE for the maximum offset we can mange in the lookup
> code. This is mainly driven by the varying
> TCG_TARGET_TLB_DISPLACEMENT_BITS each backend has available to it.

What I implemented is what you suggest, but with dynamic resizing based
on usage. I'm keeping the current CPU_TLB_SIZE as the minimum size, and
took Pranith's TCG_TARGET_TLB_MAX_INDEX_BITS definitions (from 2017)
to limit the max tlb size per mmu.

I'll prepare an RFC.

Thanks,

		Emilio

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2018-10-02 18:09 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20180919175423.GA25553@flamenco>
     [not found] ` <87va71uijc.fsf@linaro.org>
2018-10-01 18:34   ` [Qemu-devel] ideas for improving TLB performance (help with TCG backend wanted) Emilio G. Cota
2018-10-01 20:40     ` Richard Henderson
2018-10-02  1:54       ` Emilio G. Cota
2018-10-02  6:48     ` Alex Bennée
2018-10-02 18:09       ` Emilio G. Cota

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.