From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:59370)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <rth7680@gmail.com>) id 1UiVtH-0001ss-H7
	for qemu-devel@nongnu.org; Fri, 31 May 2013 16:25:38 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <rth7680@gmail.com>) id 1UiVtD-0001a6-P5
	for qemu-devel@nongnu.org; Fri, 31 May 2013 16:25:35 -0400
Received: from mail-vc0-f176.google.com ([209.85.220.176]:49539)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <rth7680@gmail.com>) id 1UiVtD-0001Zt-Ks
	for qemu-devel@nongnu.org; Fri, 31 May 2013 16:25:31 -0400
Received: by mail-vc0-f176.google.com with SMTP id ha11so1352721vcb.7
	for <qemu-devel@nongnu.org>; Fri, 31 May 2013 13:25:31 -0700 (PDT)
Sender: Richard Henderson <rth7680@gmail.com>
Message-ID: <51A90736.4090005@twiddle.net>
Date: Fri, 31 May 2013 13:25:26 -0700
From: Richard Henderson <rth@twiddle.net>
MIME-Version: 1.0
References: <51A8E339.5000500@huawei.com> <51A8E6E2.1010704@huawei.com>
In-Reply-To: <51A8E6E2.1010704@huawei.com>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH 4/4] tcg/aarch64: implement tlb lookup fast
	path
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Jani Kokkonen <Jani.Kokkonen@huawei.com>
Cc: Laurent Desnogues <laurent.desnogues@gmail.com>, Peter Maydell <peter.maydell@linaro.org>, Claudio Fontana <claudio.fontana@huawei.com>, qemu-devel@nongnu.org

On 05/31/2013 11:07 AM, Jani Kokkonen wrote:
> +/* Load and compare a TLB entry, leaving the flags set.  Leaves X2 pointing
> +   to the tlb entry.  Clobbers X0,X1,X2,X3 and TMP.  */
> +
> +static void tcg_out_tlb_read(TCGContext *s, TCGReg addr_reg,
> +                             int s_bits, uint8_t **label_ptr, int tlb_offset)
> +{

You copied the comment from ARM, and it isn't correct.  You generate branches.

> +    TCGReg base = TCG_AREG0;
> +
> +    tcg_out_shr(s, 1, TCG_REG_TMP, addr_reg, TARGET_PAGE_BITS);
> +    tcg_out_movi(s, TCG_TYPE_I64, TCG_REG_X1, tlb_offset);
> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, base, TCG_REG_X1, 0);
> +    tcg_out_andi(s, 1, TCG_REG_X0, TCG_REG_TMP, CPU_TLB_BITS, 0);
> +    tcg_out_arith(s, ARITH_ADD, 1, TCG_REG_X2, TCG_REG_X2,
> +                             TCG_REG_X0, -CPU_TLB_ENTRY_BITS);
> +#if TARGET_LONG_BITS == 64
> +    tcg_out_ldst(s, LDST_64, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#else
> +    tcg_out_ldst(s, LDST_32, LDST_LD, TCG_REG_X3, TCG_REG_X2, 0);
> +#endif
> +    /* check alignment */
> +    if (s_bits) {
> +        tcg_out_tst(s, 1, addr_reg, s_bits, 0);
> +        label_ptr[0] = s->code_ptr;
> +        tcg_out_goto_cond_noaddr(s, TCG_COND_NE);
> +    }
> +    tcg_out_cmp(s, 1, TCG_REG_X3, TCG_REG_TMP, -TARGET_PAGE_BITS);
> +    label_ptr[1] = s->code_ptr;
> +    tcg_out_goto_cond_noaddr(s, TCG_COND_NE);

I'm positive that the branch predictor would be happier with a single branch
rather than the two you generate here.  It ought to be possible to use a
different set of insns to do this in one go.

How about something like

	@ extract the tlb index from the address
	ubfm	w0, addr_reg, TARGET_PAGE_BITS, CPU_TLB_BITS

	@ add any "high bits" from the tlb offset
	@ noting that env will be much smaller than 24 bits.
	add	x1, env, tlb_offset & 0xfff000

	@ zap the tlb index from the address for compare
	@ this is all high bits plus 0-3 low bits set, so this
	@ should match a logical immediate.
	and	w/x2, addr_reg, TARGET_PAGE_MASK | ((1 << s_bits) - 1)

	@ merge the tlb index into the env+tlb_offset
	add	x1, x1, x0, lsl #3

	@ load the tlb comparator.  the 12-bit scaled offset
	@ form will fit the bits remaining from above, given that
	@ we're loading an aligned object, and so the low 2/3 bits
	@ will be clear.
	ldr	w/x0, [x1, tlb_offset & 0xfff]

	@ load the tlb addend.  do this early to avoid stalling.
	@ the addend_offset differs from tlb_offset by 1-3 words.
	@ given that we've got overlap between the scaled 12-bit
	@ value and the 12-bit shifted value above, this also ought
	@ to always be representable.
	ldr	x3, [x1, (tlb_offset & 0xfff) + (addend_offset - tlb_offset)]

	@ perform the comparison
	cmp	w/x0, w/x2

	@ generate the complete host address in parallel with the cmp.
	add	x3, x3, addr_reg		@ 64-bit guest
	add	x3, x3, addr_reg, uxtw		@ 32-bit guest

	bne	miss_label

Note that the w/x above indicates the ext setting that ought to be used,
depending on the address size of the guest.

This is at least 2 insns shorter than your sequence.

Have you looked at doing the out-of-line tlb miss sequence right from the
very beginning?  It's not that much more difficult to accomplish than the
inline tlb miss.

See CONFIG_QEMU_LDST_OPTIMIZATION, and the implementation in tcg/arm.
You won't need two nops after the call; aarch64 can do all the required
extensions and data movement operations in a single insn.


r~