Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path

From: Jani Kokkonen <Jani.Kokkonen@huawei.com>
To: Richard Henderson <rth@twiddle.net>
Cc: Laurent Desnogues <laurent.desnogues@gmail.com>,
	Peter Maydell <peter.maydell@linaro.org>,
	Claudio Fontana <claudio.fontana@huawei.com>,
	qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path
Date: Mon, 3 Jun 2013 13:21:55 +0200	[thread overview]
Message-ID: <51AC7C53.40105@huawei.com> (raw)
In-Reply-To: <51A90736.4090005@twiddle.net>

On 5/31/2013 10:25 PM, Richard Henderson wrote:
> On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
>> +/* Load and compare a TLB entry, leaving the flags set.  Leaves X2 pointing
>> +   to the tlb entry.  Clobbers X0,X1,X2,X3 and TMP.  */
>> +
>> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
>> +                             int s_bits, uint8_t **label_ptr, int tlb_offset)
>> +{
> 
> You copied the comment from ARM, and it isn't correct.  You generate branches.

I will fix the comment.
> 
>> +    TCGReg base = TCG_AREG0;
>> +
>> +    tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
>> +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
>> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
>> +    tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
>> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
>> +                             TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
>> +#if TARGET_LONG_BITS == 64
>> +    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#else
>> +    tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
>> +#endif
>> +    /* check alignment */
>> +    if (s_bits) {
>> +        tcg_out_tst(s, 1, addr_reg, s_bits, 0);
>> +        label_ptr[0] = s->code_ptr;
>> +        tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
>> +    }
>> +    tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
>> +    label_ptr[1] = s->code_ptr;
>> +    tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
> 
> I'm positive that the branch predictor would be happier with a single branch
> rather than the two you generate here.  It ought to be possible to use a
> different set of insns to do this in one go.
> 
> How about something like
> 
> 	@ extract the tlb index from the address
> 	ubfm	w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS
> 
> 	@ add any "high bits" from the tlb offset
> 	@ noting that env will be much smaller than 24 bits.
> 	add	x1, env, tlb_offset & 0xfff000
> 
> 	@ zap the tlb index from the address for compare
> 	@ this is all high bits plus 0-3 low bits set, so this
> 	@ should match a logical immediate.
> 	and	w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)
> 
> 	@ merge the tlb index into the env+tlb_offset
> 	add	x1, x1, x0, lsl #3
> 
> 	@ load the tlb comparator.  the 12-bit scaled offset
> 	@ form will fit the bits remaining from above, given that
> 	@ we're loading an aligned object, and so the low 2/3 bits
> 	@ will be clear.
> 	ldr	w/x0, [x1, tlb_offset & 0xfff]
> 
> 	@ load the tlb addend.  do this early to avoid stalling.
> 	@ the addend_offset differs from tlb_offset by 1-3 words.
> 	@ given that we've got overlap between the scaled 12-bit
> 	@ value and the 12-bit shifted value above, this also ought
> 	@ to always be representable.
> 	ldr	x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]
> 
> 	@ perform the comparison
> 	cmp	w/x0, w/x2
> 
> 	@ generate the complete host address in parallel with the cmp.
> 	add	x3, x3, addr_reg		@ 64-bit guest
> 	add	x3, x3, addr_reg, uxtw		@ 32-bit guest
> 
> 	bne	miss_label
> 
> Note that the w/x above indicates the ext setting that ought to be used,
> depending on the address size of the guest.
> 
> This is at least 2 insns shorter than your sequence.

Ok, thanks. ubfm instruction will be added and I will modify implementation based on your comments.
> 
> Have you looked at doing the out-of-line tlb miss sequence right from the
> very beginning?  It's not that much more difficult to accomplish than the
> inline tlb miss.

I have to look into this one.
> 
> See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
> You won't need two nops after the call; aarch64 can do all the required
> extensions and data movement operations in a single insn.
> 
> 

I will take this also into account. 

> r~
> 

-Jani