From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:59370) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UiVtH-0001ss-H7 for qemu-devel@nongnu.org; Fri, 31 May 2013 16:25:38 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UiVtD-0001a6-P5 for qemu-devel@nongnu.org; Fri, 31 May 2013 16:25:35 -0400 Received: from mail-vc0-f176.google.com ([209.85.220.176]:49539) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UiVtD-0001Zt-Ks for qemu-devel@nongnu.org; Fri, 31 May 2013 16:25:31 -0400 Received: by mail-vc0-f176.google.com with SMTP id ha11so1352721vcb.7 for ; Fri, 31 May 2013 13:25:31 -0700 (PDT) Sender: Richard Henderson Message-ID: <51A90736.4090005@twiddle.net> Date: Fri, 31 May 2013 13:25:26 -0700 From: Richard Henderson MIME-Version: 1.0 References: <51A8E339.5000500@huawei.com> <51A8E6E2.1010704@huawei.com> In-Reply-To: <51A8E6E2.1010704@huawei.com> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast path List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jani Kokkonen Cc: Laurent Desnogues , Peter Maydell , Claudio Fontana , qemu-devel@nongnu.org On 05/31/2013 11:07 AM, Jani Kokkonen wrote: > +/* Load and compare a TLB entry, leaving the flags set. Leaves X2 pointing > + to the tlb entry. Clobbers X0,X1,X2,X3 and TMP. */ > + > +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg, > + int s_bits, uint8_t **label_ptr, int tlb_offset) > +{ You copied the comment from ARM, and it isn't correct. You generate branches. > + TCGReg base = TCG_AREG0; > + > + tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS); > + tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset); > + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0); > + tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0); > + tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2, > + TCG_REG_X0, -CPU_TLB_ENTRY_BITS); > +#if TARGET_LONG_BITS == 64 > + tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0); > +#else > + tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0); > +#endif > + /* check alignment */ > + if (s_bits) { > + tcg_out_tst(s, 1, addr_reg, s_bits, 0); > + label_ptr[0] = s->code_ptr; > + tcg_out_goto_cond_noaddr(s, TCG_COND_NE); > + } > + tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS); > + label_ptr[1] = s->code_ptr; > + tcg_out_goto_cond_noaddr(s, TCG_COND_NE); I'm positive that the branch predictor would be happier with a single branch rather than the two you generate here. It ought to be possible to use a different set of insns to do this in one go. How about something like @ extract the tlb index from the address ubfm w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS @ add any "high bits" from the tlb offset @ noting that env will be much smaller than 24 bits. add x1, env, tlb_offset & 0xfff000 @ zap the tlb index from the address for compare @ this is all high bits plus 0-3 low bits set, so this @ should match a logical immediate. and w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1) @ merge the tlb index into the env+tlb_offset add x1, x1, x0, lsl #3 @ load the tlb comparator. the 12-bit scaled offset @ form will fit the bits remaining from above, given that @ we're loading an aligned object, and so the low 2/3 bits @ will be clear. ldr w/x0, [x1, tlb_offset & 0xfff] @ load the tlb addend. do this early to avoid stalling. @ the addend_offset differs from tlb_offset by 1-3 words. @ given that we've got overlap between the scaled 12-bit @ value and the 12-bit shifted value above, this also ought @ to always be representable. ldr x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)] @ perform the comparison cmp w/x0, w/x2 @ generate the complete host address in parallel with the cmp. add x3, x3, addr_reg @ 64-bit guest add x3, x3, addr_reg, uxtw @ 32-bit guest bne miss_label Note that the w/x above indicates the ext setting that ought to be used, depending on the address size of the guest. This is at least 2 insns shorter than your sequence. Have you looked at doing the out-of-line tlb miss sequence right from the very beginning? It's not that much more difficult to accomplish than the inline tlb miss. See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm. You won't need two nops after the call; aarch64 can do all the required extensions and data movement operations in a single insn. r~