All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
@ 2016-04-04 15:51 Peter Maydell
  2016-04-04 16:28 ` Richard Henderson
  2016-04-04 16:35 ` Peter Maydell
  0 siblings, 2 replies; 12+ messages in thread
From: Peter Maydell @ 2016-04-04 15:51 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Thomas Hanson, Richard Henderson

I was wondering about what the best way is to implement emulation in
TCG of the AArch64 tagged-addresses feature.
(cc'd Tom Hanson who's looking at actually writing code for this,
and RTH for review of the design sketch below.)

Quick summary of the feature (which is described in the v8 ARM ARM
section D4.1.1 "Address tagging in AArch64 state"):
If the 'tagged addresses' bit in TCR_EL1 is set then:
 * the top 8 bits of virtual addresses are ignored for doing va-to-pa
   translation (addresses are sign extended from bit 55)
 * the top 8 bits are ignored for purposes of TLB-invalidate-by-address
 * various operations that set the PC (branches, exception returns, etc)
   sign-extend the new PC value from bit 55
 * for a data abort or watchpoint hit, the virtual address reported in
   the FAR (fault address register) includes the tag bits

(Complication, for EL0/EL1 there are two 'enable tags' bits
in TCR_EL1, and which one you use depends on bit 55 of the VA,
so you can (say) enable tags for the "lower" half of the virtual
address space, and disable them for the "higher" half.)

I thought of two possible ways to approach implementing this.
Option (1) would be to change the codegen in translate-a64.c so that
we mask out high bits before doing the QEMU load/store TCG op.
Option (2) leaves the VA that we pass to the TCG load/store alone
(ie with tag bits intact) and tries to handle this all in the va-to-pa
code.

I think option (1) is a non-starter because of that requirement to
report the full address with tags in the FAR (as well as being slower
due to all the extra masking on memory operations). So that leaves
option (2), possibly with some help from common code to make things
a bit less awkward.

In particular I think if you just do the relevant handling of the tag
bits in target-arm's get_phys_addr() and its subroutines then this
should work ok, with the exceptions that:
 * the QEMU TLB code will think that [tag A + address X] and
   [tag B + address X] are different virtual addresses and they will
   miss each other in the TLB
 * tlb invalidate by address becomes nasty because we need to invalidate
   [every tag + address X]
Can we fix those just by having arm_tlb_fill() call
tlb_set_page_with_attrs() with the vaddr with the tag masked out?

Have I missed some complication that would make this not work?

[NB: this is all assuming softmmu; getting tagged addresses to work
in linux-user mode would require doing the masking in translate.c,
but I definitely don't want two implementations so I guess we just
ignore linux-user here.]

thanks
-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-04 15:51 [Qemu-devel] best way to implement emulation of AArch64 tagged addresses Peter Maydell
@ 2016-04-04 16:28 ` Richard Henderson
  2016-04-04 16:31   ` Peter Maydell
  2016-04-04 16:35 ` Peter Maydell
  1 sibling, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2016-04-04 16:28 UTC (permalink / raw)
  To: Peter Maydell, QEMU Developers; +Cc: Thomas Hanson

On 04/04/2016 08:51 AM, Peter Maydell wrote:
> I thought of two possible ways to approach implementing this.
> Option (1) would be to change the codegen in translate-a64.c so that
> we mask out high bits before doing the QEMU load/store TCG op.
> Option (2) leaves the VA that we pass to the TCG load/store alone
> (ie with tag bits intact) and tries to handle this all in the va-to-pa
> code.
>
> I think option (1) is a non-starter because of that requirement to
> report the full address with tags in the FAR (as well as being slower
> due to all the extra masking on memory operations). So that leaves
> option (2), possibly with some help from common code to make things
> a bit less awkward.

Agreed.

> In particular I think if you just do the relevant handling of the tag
> bits in target-arm's get_phys_addr() and its subroutines then this
> should work ok, with the exceptions that:
>   * the QEMU TLB code will think that [tag A + address X] and
>     [tag B + address X] are different virtual addresses and they will
>     miss each other in the TLB

Yep.  Not only miss, but actively contend with each other.

>   * tlb invalidate by address becomes nasty because we need to invalidate
>     [every tag + address X]

Hmm.  We should require only one flush for X.  But the common code doesn't know 
that...  I suppose a new tlb_flush_page_mask would do the trick.

> Can we fix those just by having arm_tlb_fill() call
> tlb_set_page_with_attrs() with the vaddr with the tag masked out?

No, that misses when we perform the full vaddr+tag comparison on the TCG fast path.

> [NB: this is all assuming softmmu; getting tagged addresses to work
> in linux-user mode would require doing the masking in translate.c,
> but I definitely don't want two implementations so I guess we just
> ignore linux-user here.]

Let's just say it's another user for the oft wished for softmmu-in-linux-user.


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-04 16:28 ` Richard Henderson
@ 2016-04-04 16:31   ` Peter Maydell
  2016-04-04 17:56     ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Peter Maydell @ 2016-04-04 16:31 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Thomas Hanson, QEMU Developers

On 4 April 2016 at 17:28, Richard Henderson <rth@twiddle.net> wrote:
> On 04/04/2016 08:51 AM, Peter Maydell wrote:
>> In particular I think if you just do the relevant handling of the tag
>> bits in target-arm's get_phys_addr() and its subroutines then this
>> should work ok, with the exceptions that:
>>   * the QEMU TLB code will think that [tag A + address X] and
>>     [tag B + address X] are different virtual addresses and they will
>>     miss each other in the TLB
>
>
> Yep.  Not only miss, but actively contend with each other.

Yes. Can we avoid that, or do we just have to live with it? I guess
if the TCG fast path is doing a compare on full insn+tag then we
pretty much have to live with it.

>>   * tlb invalidate by address becomes nasty because we need to invalidate
>>     [every tag + address X]
>
> Hmm.  We should require only one flush for X.  But the common code doesn't
> know that...  I suppose a new tlb_flush_page_mask would do the trick.

Yes, I think we would need that.

>> Can we fix those just by having arm_tlb_fill() call
>> tlb_set_page_with_attrs() with the vaddr with the tag masked out?
>
> No, that misses when we perform the full vaddr+tag comparison on the TCG
> fast path.

Rats, you're right.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-04 15:51 [Qemu-devel] best way to implement emulation of AArch64 tagged addresses Peter Maydell
  2016-04-04 16:28 ` Richard Henderson
@ 2016-04-04 16:35 ` Peter Maydell
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Maydell @ 2016-04-04 16:35 UTC (permalink / raw)
  To: QEMU Developers; +Cc: Thomas Hanson, Richard Henderson

On 4 April 2016 at 16:51, Peter Maydell <peter.maydell@linaro.org> wrote:
> In particular I think if you just do the relevant handling of the tag
> bits in target-arm's get_phys_addr() and its subroutines then this
> should work ok

Forgot to mention, but the handling of "do stuff on various stores
to PC" needs to be done in translate-a64.c for branch insns and
in the exception entry/exit functions in helper.c/op_helper.c
for exceptions. That part should be straightforward, though.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-04 16:31   ` Peter Maydell
@ 2016-04-04 17:56     ` Richard Henderson
  2016-04-08 17:20       ` Tom Hanson
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2016-04-04 17:56 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Thomas Hanson, QEMU Developers

On 04/04/2016 09:31 AM, Peter Maydell wrote:
> On 4 April 2016 at 17:28, Richard Henderson <rth@twiddle.net> wrote:
>> On 04/04/2016 08:51 AM, Peter Maydell wrote:
>>> In particular I think if you just do the relevant handling of the tag
>>> bits in target-arm's get_phys_addr() and its subroutines then this
>>> should work ok, with the exceptions that:
>>>    * the QEMU TLB code will think that [tag A + address X] and
>>>      [tag B + address X] are different virtual addresses and they will
>>>      miss each other in the TLB
>>
>>
>> Yep.  Not only miss, but actively contend with each other.
>
> Yes. Can we avoid that, or do we just have to live with it? I guess
> if the TCG fast path is doing a compare on full insn+tag then we
> pretty much have to live with it.

We have to live with it.  Implementing a more complex hashing algorithm in the 
fast path is probably a non-starter.

Hopefully if one is using multiple tags, they'll still be in the victim cache 
and so you won't have to fall back to the full tlb lookup.


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-04 17:56     ` Richard Henderson
@ 2016-04-08 17:20       ` Tom Hanson
  2016-04-08 18:06         ` Peter Maydell
  2016-04-08 18:10         ` Richard Henderson
  0 siblings, 2 replies; 12+ messages in thread
From: Tom Hanson @ 2016-04-08 17:20 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers

On Mon, 2016-04-04 at 10:56 -0700, Richard Henderson wrote:
> On 04/04/2016 09:31 AM, Peter Maydell wrote:
> > On 4 April 2016 at 17:28, Richard Henderson <rth@twiddle.net> wrote:
> >> On 04/04/2016 08:51 AM, Peter Maydell wrote:
> >>> In particular I think if you just do the relevant handling of the tag
> >>> bits in target-arm's get_phys_addr() and its subroutines then this
> >>> should work ok, with the exceptions that:
> >>>    * the QEMU TLB code will think that [tag A + address X] and
> >>>      [tag B + address X] are different virtual addresses and they will
> >>>      miss each other in the TLB
> >>
> >>
> >> Yep.  Not only miss, but actively contend with each other.
> >
> > Yes. Can we avoid that, or do we just have to live with it? I guess
> > if the TCG fast path is doing a compare on full insn+tag then we
> > pretty much have to live with it.
> 
> We have to live with it.  Implementing a more complex hashing algorithm in the 
> fast path is probably a non-starter.
> 
> Hopefully if one is using multiple tags, they'll still be in the victim cache 
> and so you won't have to fall back to the full tlb lookup.
> 
> 
> r~

It seems like the "best" solution would be to mask the tag in the TLB
and it feels like it should be possible.  BUT I need to dig into the
code more.

Is it an option to mask off the tag bits in all cases? Is there any case
it which those bits are valid address bits?

-TWH

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-08 17:20       ` Tom Hanson
@ 2016-04-08 18:06         ` Peter Maydell
  2016-04-08 18:10         ` Richard Henderson
  1 sibling, 0 replies; 12+ messages in thread
From: Peter Maydell @ 2016-04-08 18:06 UTC (permalink / raw)
  To: Tom Hanson; +Cc: Richard Henderson, QEMU Developers

On 8 April 2016 at 18:20, Tom Hanson <thomas.hanson@linaro.org> wrote:
> On Mon, 2016-04-04 at 10:56 -0700, Richard Henderson wrote:
>> On 04/04/2016 09:31 AM, Peter Maydell wrote:
>> > On 4 April 2016 at 17:28, Richard Henderson <rth@twiddle.net> wrote:
>> >> On 04/04/2016 08:51 AM, Peter Maydell wrote:
>> >>> In particular I think if you just do the relevant handling of the tag
>> >>> bits in target-arm's get_phys_addr() and its subroutines then this
>> >>> should work ok, with the exceptions that:
>> >>>    * the QEMU TLB code will think that [tag A + address X] and
>> >>>      [tag B + address X] are different virtual addresses and they will
>> >>>      miss each other in the TLB
>> >>
>> >>
>> >> Yep.  Not only miss, but actively contend with each other.
>> >
>> > Yes. Can we avoid that, or do we just have to live with it? I guess
>> > if the TCG fast path is doing a compare on full insn+tag then we
>> > pretty much have to live with it.
>>
>> We have to live with it.  Implementing a more complex hashing algorithm in the
>> fast path is probably a non-starter.
>>
>> Hopefully if one is using multiple tags, they'll still be in the victim cache
>> and so you won't have to fall back to the full tlb lookup.

> It seems like the "best" solution would be to mask the tag in the TLB
> and it feels like it should be possible.  BUT I need to dig into the
> code more.
>
> Is it an option to mask off the tag bits in all cases? Is there any case
> it which those bits are valid address bits?

The problem, as Richard says, is that our fast path for guest
loads/stores is a bit of inline assembly that basically fishes
the right entry out of the TLB and compares it against the
input address (ie whatever the guest address to the load is
including the tag). A comparison match means we take the fast
path and do an inline access to the backing guest RAM. A mismatch
means we take the slow path (for TLB misses, IO devices, and
various other cases). Since the guest address that the fast
path sees includes the tag bits, if the TLB entry doesn't
include the tag bits then we'd need to do an extra mask operation
in the fast path, which is (a) not good for performance and
(b) would require modifying nine different TCG backends.

For a rarely used feature this is much too much effort (and
it slows down all the code that doesn't use tags for an
uncertain benefit to the code that does use them).

(If you're curious about the inline assembly, it's generated
by functions like tlb_out_tlb_load() in
tcg/i386/tcg-target.inc.c for the x86 backend; similarly for
the various other backends.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-08 17:20       ` Tom Hanson
  2016-04-08 18:06         ` Peter Maydell
@ 2016-04-08 18:10         ` Richard Henderson
  2016-04-09  0:29           ` Thomas Hanson
  1 sibling, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2016-04-08 18:10 UTC (permalink / raw)
  To: Tom Hanson; +Cc: Peter Maydell, QEMU Developers

On 04/08/2016 10:20 AM, Tom Hanson wrote:
> Is it an option to mask off the tag bits in all cases? Is there any case
> it which those bits are valid address bits?

It's not impossible to mask off bits in the address -- we do that for running
32-bit on 64-bit all of the time.  It's all a question of how well the average
program will perform, I suppose.

For instance.  Are there more tagged addresses than non-tagged addresses?  If
we mask off bits, that will affect *every* memory operation.  If tagged
addresses are rare, then that is a waste.  If tagged addresses are common,
however, then we may well spend too much time ping-ponging in the TLB.

The fastest method I can think of to ignore high order bits is to shift the
address comparator left.  The TLB comparator would be stored pre-shifted, so
this would add only one insn on the fast path.  Or perhaps zero in the case of
an arm/aarch64 host, where the compare insn itself can perform the shift.

Of course, a double-word shift would be completely out of the question when
doing 64-bit on 32-bit emulation.  But we don't need that -- just shift the
high part of the address left to discard bits, leaving a funny looking hole in
the middle of the comparator.

This is simple enough that it should be relatively easy to patch up all of the
tcg backends to match, if we decide to go with it.


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-08 18:10         ` Richard Henderson
@ 2016-04-09  0:29           ` Thomas Hanson
  2016-04-09 15:57             ` Richard Henderson
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Hanson @ 2016-04-09  0:29 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 2551 bytes --]

Looking at tcg_out_tlb_load():
If I'm reading the pseudo-assembler of the function names correctly, it
looks like in the i386 code we're already masking the address being
checked:
    tgen_arithi(s, ARITH_AND + trexw, r1, TARGET_PAGE_MASK | (aligned ?
s_mask : 0), 0);
where  TARGET_PAGE_MASK is a simple all-1's mask in the appropriate upper
bits.

Can we just poke some 0's into that mask in the tag locations?  And, of
course, do the same when creating the TLB entry.

Unless of course we're in the case of (TARGET_LONG_BITS >
TCG_TARGET_REG_BITS) (that would be 64 bit on 32 bit right?) when addrhi
gets tested separately. Then we'd have to do the shift as above.

MIPS logic appears similar on a quick read.  In the sparc code I'm not
seeing a pre-existing mask  but it's getting late and my eyes are giving
out.  Those are the only tcg_out_tlb_load() versions I can find.


As to frequency I'm assuming that there are far fewer tagged pointers than
untagged.  But then again I haven't seen a good use case for tagged
pointers.  Would love to hear one.

On 8 April 2016 at 12:10, Richard Henderson <rth@twiddle.net> wrote:

> On 04/08/2016 10:20 AM, Tom Hanson wrote:
> > Is it an option to mask off the tag bits in all cases? Is there any case
> > it which those bits are valid address bits?
>
> It's not impossible to mask off bits in the address -- we do that for
> running
> 32-bit on 64-bit all of the time.  It's all a question of how well the
> average
> program will perform, I suppose.
>
> For instance.  Are there more tagged addresses than non-tagged addresses?
> If
> we mask off bits, that will affect *every* memory operation.  If tagged
> addresses are rare, then that is a waste.  If tagged addresses are common,
> however, then we may well spend too much time ping-ponging in the TLB.
>
> The fastest method I can think of to ignore high order bits is to shift the
> address comparator left.  The TLB comparator would be stored pre-shifted,
> so
> this would add only one insn on the fast path.  Or perhaps zero in the
> case of
> an arm/aarch64 host, where the compare insn itself can perform the shift.
>
> Of course, a double-word shift would be completely out of the question when
> doing 64-bit on 32-bit emulation.  But we don't need that -- just shift the
> high part of the address left to discard bits, leaving a funny looking
> hole in
> the middle of the comparator.
>
> This is simple enough that it should be relatively easy to patch up all of
> the
> tcg backends to match, if we decide to go with it.
>
>
> r~
>
>

[-- Attachment #2: Type: text/html, Size: 3234 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-09  0:29           ` Thomas Hanson
@ 2016-04-09 15:57             ` Richard Henderson
  2016-04-11 12:58               ` Thomas Hanson
  0 siblings, 1 reply; 12+ messages in thread
From: Richard Henderson @ 2016-04-09 15:57 UTC (permalink / raw)
  To: Thomas Hanson; +Cc: Peter Maydell, QEMU Developers

On 04/08/2016 05:29 PM, Thomas Hanson wrote:
> Looking at tcg_out_tlb_load():
> If I'm reading the pseudo-assembler of the function names correctly, it looks
> like in the i386 code we're already masking the address being checked:
>      tgen_arithi(s, ARITH_AND + trexw, r1, TARGET_PAGE_MASK | (aligned ? s_mask
> : 0), 0);
> where  TARGET_PAGE_MASK is a simple all-1's mask in the appropriate upper bits.
>
> Can we just poke some 0's into that mask in the tag locations?

No, because we'd no longer have a sign-extended 32-bit value, as fits in that 
immediate operand field.  To load the constant you're asking for, we'd need a 
64-bit move insn and another register.


r~

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-09 15:57             ` Richard Henderson
@ 2016-04-11 12:58               ` Thomas Hanson
  2016-04-13 13:36                 ` Tom Hanson
  0 siblings, 1 reply; 12+ messages in thread
From: Thomas Hanson @ 2016-04-11 12:58 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers

[-- Attachment #1: Type: text/plain, Size: 793 bytes --]

Ah, true.

On 9 April 2016 at 09:57, Richard Henderson <rth@twiddle.net> wrote:

> On 04/08/2016 05:29 PM, Thomas Hanson wrote:
>
>> Looking at tcg_out_tlb_load():
>> If I'm reading the pseudo-assembler of the function names correctly, it
>> looks
>> like in the i386 code we're already masking the address being checked:
>>      tgen_arithi(s, ARITH_AND + trexw, r1, TARGET_PAGE_MASK | (aligned ?
>> s_mask
>> : 0), 0);
>> where  TARGET_PAGE_MASK is a simple all-1's mask in the appropriate upper
>> bits.
>>
>> Can we just poke some 0's into that mask in the tag locations?
>>
>
> No, because we'd no longer have a sign-extended 32-bit value, as fits in
> that immediate operand field.  To load the constant you're asking for, we'd
> need a 64-bit move insn and another register.
>
>
> r~
>

[-- Attachment #2: Type: text/html, Size: 1332 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] best way to implement emulation of AArch64 tagged addresses
  2016-04-11 12:58               ` Thomas Hanson
@ 2016-04-13 13:36                 ` Tom Hanson
  0 siblings, 0 replies; 12+ messages in thread
From: Tom Hanson @ 2016-04-13 13:36 UTC (permalink / raw)
  To: Richard Henderson; +Cc: Peter Maydell, QEMU Developers

On 04/11/2016 06:58 AM, Thomas Hanson wrote:
> Ah, true.
>
> On 9 April 2016 at 09:57, Richard Henderson <rth@twiddle.net
> <mailto:rth@twiddle.net>> wrote:
>
>     On 04/08/2016 05:29 PM, Thomas Hanson wrote:
>
>         Looking at tcg_out_tlb_load():
>         If I'm reading the pseudo-assembler of the function names
>         correctly, it looks
>         like in the i386 code we're already masking the address being
>         checked:
>               tgen_arithi(s, ARITH_AND + trexw, r1, TARGET_PAGE_MASK |
>         (aligned ? s_mask
>         : 0), 0);
>         where  TARGET_PAGE_MASK is a simple all-1's mask in the
>         appropriate upper bits.
>
>         Can we just poke some 0's into that mask in the tag locations?
>
>
>     No, because we'd no longer have a sign-extended 32-bit value, as
>     fits in that immediate operand field.  To load the constant you're
>     asking for, we'd need a 64-bit move insn and another register.
>
>
>     r~
>
>
[Sorry for the previous top post(s).  I've switched email clients...]

So, is the consensus that it's not worth adding an instruction to the 
fast path to avoid kicking out TLB entries with non-matching tags?

Or is this still under consideration?

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2016-04-13 13:36 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-04-04 15:51 [Qemu-devel] best way to implement emulation of AArch64 tagged addresses Peter Maydell
2016-04-04 16:28 ` Richard Henderson
2016-04-04 16:31   ` Peter Maydell
2016-04-04 17:56     ` Richard Henderson
2016-04-08 17:20       ` Tom Hanson
2016-04-08 18:06         ` Peter Maydell
2016-04-08 18:10         ` Richard Henderson
2016-04-09  0:29           ` Thomas Hanson
2016-04-09 15:57             ` Richard Henderson
2016-04-11 12:58               ` Thomas Hanson
2016-04-13 13:36                 ` Tom Hanson
2016-04-04 16:35 ` Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.