From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:35637) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UjSqg-00080v-He for qemu-devel@nongnu.org; Mon, 03 Jun 2013 07:23:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UjSqX-0007wc-Iw for qemu-devel@nongnu.org; Mon, 03 Jun 2013 07:22:50 -0400 Received: from lhrrgout.huawei.com ([194.213.3.17]:28622) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UjSqX-0007ve-9O for qemu-devel@nongnu.org; Mon, 03 Jun 2013 07:22:41 -0400 Message-ID: <51AC7C53.40105@huawei.com> Date: Mon, 3 Jun 2013 13:21:55 +0200 From: Jani Kokkonen MIME-Version: 1.0 References: <51A8E339.5000500@huawei.com> <51A8E6E2.1010704@huawei.com> <51A90736.4090005@twiddle.net> In-Reply-To: <51A90736.4090005@twiddle.net> Content-Type: text/plain; charset="ISO-8859-1" Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Richard Henderson Cc: Laurent Desnogues , Peter Maydell , Claudio Fontana , qemu-devel@nongnu.org On 5/31/2013 10:25 PM, Richard Henderson wrote: > On 05/31/2013 11:07 AM, Jani Kokkonen wrote: >> +/* Load and compare a TLB entry, leaving the flags set. Leaves X2 pointing >> + to the tlb entry. Clobbers X0,X1,X2,X3 and TMP. */ >> + >> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg, >> + int s_bits, uint8_t **label_ptr, int tlb_offset) >> +{ > > You copied the comment from ARM, and it isn't correct. You generate branches. I will fix the comment. > >> + TCGReg base = TCG_AREG0; >> + >> + tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS); >> + tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset); >> + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0); >> + tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0); >> + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2, >> + TCG_REG_X0, -CPU_TLB_ENTRY_BITS); >> +#if TARGET_LONG_BITS == 64 >> + tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0); >> +#else >> + tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0); >> +#endif >> + /* check alignment */ >> + if (s_bits) { >> + tcg_out_tst(s, 1, addr_reg, s_bits, 0); >> + label_ptr[0] = s->code_ptr; >> + tcg_out_goto_cond_noaddr(s, TCG_COND_NE); >> + } >> + tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS); >> + label_ptr[1] = s->code_ptr; >> + tcg_out_goto_cond_noaddr(s, TCG_COND_NE); > > I'm positive that the branch predictor would be happier with a single branch > rather than the two you generate here. It ought to be possible to use a > different set of insns to do this in one go. > > How about something like > > @ extract the tlb index from the address > ubfm w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS > > @ add any "high bits" from the tlb offset > @ noting that env will be much smaller than 24 bits. > add x1, env, tlb_offset & 0xfff000 > > @ zap the tlb index from the address for compare > @ this is all high bits plus 0-3 low bits set, so this > @ should match a logical immediate. > and w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1) > > @ merge the tlb index into the env+tlb_offset > add x1, x1, x0, lsl #3 > > @ load the tlb comparator. the 12-bit scaled offset > @ form will fit the bits remaining from above, given that > @ we're loading an aligned object, and so the low 2/3 bits > @ will be clear. > ldr w/x0, [x1, tlb_offset & 0xfff] > > @ load the tlb addend. do this early to avoid stalling. > @ the addend_offset differs from tlb_offset by 1-3 words. > @ given that we've got overlap between the scaled 12-bit > @ value and the 12-bit shifted value above, this also ought > @ to always be representable. > ldr x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)] > > @ perform the comparison > cmp w/x0, w/x2 > > @ generate the complete host address in parallel with the cmp. > add x3, x3, addr_reg @ 64-bit guest > add x3, x3, addr_reg, uxtw @ 32-bit guest > > bne miss_label > > Note that the w/x above indicates the ext setting that ought to be used, > depending on the address size of the guest. > > This is at least 2 insns shorter than your sequence. Ok, thanks. ubfm instruction will be added and I will modify implementation based on your comments. > > Have you looked at doing the out-of-line tlb miss sequence right from the > very beginning? It's not that much more difficult to accomplish than the > inline tlb miss. I have to look into this one. > > See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm. > You won't need two nops after the call; aarch64 can do all the required > extensions and data movement operations in a single insn. > > I will take this also into account. > r~ > -Jani