From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: ** X-Spam-Status: No, score=2.7 required=3.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 455CBC47253 for ; Fri, 1 May 2020 16:31:11 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id ED02A20836 for ; Fri, 1 May 2020 16:31:10 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="NhpkBe5C" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org ED02A20836 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:45364 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1jUYZC-0007St-4h for qemu-devel@archiver.kernel.org; Fri, 01 May 2020 12:31:10 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:45988) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1jUYXt-0006KH-KF for qemu-devel@nongnu.org; Fri, 01 May 2020 12:29:50 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.90_1) (envelope-from ) id 1jUYXs-0006bN-AA for qemu-devel@nongnu.org; Fri, 01 May 2020 12:29:49 -0400 Received: from mail-lj1-x232.google.com ([2a00:1450:4864:20::232]:44077) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1jUYXr-0006Wu-T1; Fri, 01 May 2020 12:29:47 -0400 Received: by mail-lj1-x232.google.com with SMTP id a21so3056444ljj.11; Fri, 01 May 2020 09:29:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:reply-to:from:date:message-id :subject:to:cc; bh=FoQYJ2DkJfhvLezTbgCOOb3XRTDlhYslVQkhultewgY=; b=NhpkBe5C++Rm/u+OkOOpQLPt7CL2TK9kVlM7DU2HTdbqSW0THPxlF8f3vqIssQCafI hxKFX1CIVOXCtakpKCtqr+AVC3fvbvnDanmAF0ZbphkXO8Lw8GlBt1WcH0p1MhlY0YQ1 b0Jr73IItCVmGDVnTe1kcUQKKlsAfsNZH4ilWfM0W+1jakuDLz3JsTvKXy554cZhJ6qH CE+nIcKyDR8kTF4YH2BXrXs9hK5hnr2wif65GbisDNLKBa2W/7C8vosHuKw/ifaig3o8 ieG655buepkIvWQtpkxzhjIXQR9XOijTu1/hPyjTGRAYpoYJ/btkRCg3GJlakeWT/fPJ gfgg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:reply-to :from:date:message-id:subject:to:cc; bh=FoQYJ2DkJfhvLezTbgCOOb3XRTDlhYslVQkhultewgY=; b=JjPEt/++6EL7vWQ/zh1PYRYsuTKVCmViRpWT1uuGJhcg7I9MCZ6lmIqoIdQQfCKlOo K7Lbxgow1KE+lhE21jQhRpNClIH4crLUIixITbuAHSSSJT02b31yjjEskRJIgAS4HBVQ lMIBBq4NKfet0zK6cf5mn+oceTpkpAbS2bSXx1ZpvbFIkeLmAw7UnlHjg1PASTOE99sF Xgzs1/UYzMvnf5bp1ZAzrZiO1Era0zr1fkWF3memqhaBnG8R+eBGv9FFk9odV73oA11U rTeMcdbMzH+eg8c096Gug/GEyHzo3YqkV5h6xItlQ1RQh6Jqf4kYg5XEjgiic29m3iik CpGQ== X-Gm-Message-State: AGi0PuZvyJP/QGbyEgGwu91BfmKLjYxHy54VzHYlFj49/1ITXUyr+bFg s4ajG8EJCp37Li1ar5mlHDt8gKjxHHym+ZjYgtI= X-Google-Smtp-Source: APiQypIW8qtT0Kt7jSodMERJLXAwUrieV0Z7awcRKJHAzIhHNsBV+gw2pKHQYQIl7c8hLnuCu2Q3/rmNMGWXCP/HDV4= X-Received: by 2002:a2e:b17a:: with SMTP id a26mr2662205ljm.215.1588350585594; Fri, 01 May 2020 09:29:45 -0700 (PDT) MIME-Version: 1.0 References: <87ftcoknvu.fsf@linaro.org> <871ro6ld2f.fsf@linaro.org> <87sggmjgit.fsf@linaro.org> <43ac337c-752a-7151-1e88-de01949571de@linaro.org> <874kszkdhm.fsf@linaro.org> In-Reply-To: From: =?UTF-8?B?572X5YuH5YiaKFlvbmdnYW5nIEx1byk=?= Date: Sat, 2 May 2020 00:29:33 +0800 Message-ID: Subject: Re: About hardfloat in ppc To: Richard Henderson Content-Type: multipart/alternative; boundary="0000000000006ee9a105a498b0bb" Received-SPF: pass client-ip=2a00:1450:4864:20::232; envelope-from=luoyonggang@gmail.com; helo=mail-lj1-x232.google.com X-detected-operating-system: by eggs.gnu.org: Error: [-] PROGRAM ABORT : Malformed IPv6 address (bad octet value). Location : parse_addr6(), p0f-client.c:67 X-Received-From: 2a00:1450:4864:20::232 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Reply-To: luoyonggang@gmail.com Cc: Dino Papararo , "qemu-devel@nongnu.org" , Programmingkid , "qemu-ppc@nongnu.org" , Howard Spoelstra , =?UTF-8?B?QWxleCBCZW5uw6ll?= Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" --0000000000006ee9a105a498b0bb Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Fri, May 1, 2020 at 10:18 PM Richard Henderson < richard.henderson@linaro.org> wrote: > On 5/1/20 6:10 AM, Alex Benn=C3=A9e wrote: > > > > =E7=BD=97=E5=8B=87=E5=88=9A(Yonggang Luo) write= s: > > > >> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan > wrote: > >> > >>> On Fri, 1 May 2020, =E7=BD=97=E5=8B=87=E5=88=9A(Yonggang Luo) wrote: > >>>> That's what I suggested, > >>>> We preserve a float computing cache > >>>> typedef struct FpRecord { > >>>> uint8_t op; > >>>> float32 A; > >>>> float32 B; > >>>> } FpRecord; > >>>> FpRecord fp_cache[1024]; > >>>> int fp_cache_length; > >>>> uint32_t fp_exceptions; > >>>> > >>>> 1. For each new fp operation we push it to the fp_cache, > >>>> 2. Once we read the fp_exceptions , then we re-compute > >>>> the fp_exceptions by re-running the fp FpRecord sequence. > >>>> and clear fp_cache_length. > >>> > >>> Why do you need to store more than the last fp op? The cumulative bit= s > can > >>> be tracked like it's done for other targets by not clearing fp_status > then > >>> you can read it from there. Only the non-sticky FI bit needs to be > >>> computed but that's only determined by the last op so it's enough to > >>> remember that and run that with softfloat (or even hardfloat after > >>> clearing status but softfloat may be faster for this) to get the bits > for > >>> last op when status is read. > >>> > >> Yeap, store only the last fp op is also an option. Do you means that > store > >> the last fp op, > >> and calculate it when necessary? I am thinking about a general fp > >> optmize method that suite > >> for all target. > > > > I think that's getting a little ahead of yourself. Let's prove the > > technique is valuable for PPC (given it has the most to gain). We can > > always generalise later if it's worthwhile. > > Indeed. > > > Rather than creating a new structure I would suggest creating 3 new tcg > > globals (op, inA, inB) and re-factor the front-end code so each FP op > > loaded the TCG globals. The TCG optimizer should pick up aliased loads > > and automatically eliminate the dead ones. We might need some new > > machinery for the TCG to avoid spilling the values over potentially > > faulting loads/stores but that is likely a phase 2 problem. > > There's no point in new tcg globals. > > Every fp operation can raise an exception, and therefore every fp operati= on > will flush tcg globals to memory. Therefore there is no optimization to = be > done at the tcg opcode level. > > However, every fp operation calls a helper function, and the quickest > thing to > do is store the inputs to env->(op, inA, inB, inC) in the helper before > performing the operation. > > > > Next you will want to find places that care about the per-op bits of > > cpu_fpscr and call a helper with the new globals to re-run the > > computation and feed the values in. > > Before we even get to this deferred fp operation thing, there are several > giant > improvements to ppc emulation that can be made: > > Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus(= ). > I've mentioned this before, that it's possible to leave the steady-state = of > env->fp_status.exception_flags =3D=3D 0, so there's no need for a separat= e > function > call. I suspect this is worth a decent speedup by itself. > Hi Richard, what kinds of rearrange the fp need to be done? Can you give me a more detailed example? I am still not get the idea. > > Step 2 is to notice when all fp exceptions are masked, so that no > exception can > be raised, and set a tb_flags bit. This is the default fp environment th= at > libc enables and therefore extremely common. > > Currently, ppc has 3 helpers called per fp operation. If step 1 is handl= ed > correctly, then we're down to 2 fp helpers per fp operation. If no > exceptions > need raising, then we can perform the entire operation with a single > function call. > > We would require a parallel set of fp helpers that (1) performs the > operation > and (2) does any post-processing of the exception bits straight away, but > (3) > without raising any exceptions. Sort of like helper_fadd + > do_float_check_status, but less. IIRC the only real extra work is > categorizing > invalid exceptions. We could even plausibly extend softfloat to do that > while > it is recording the invalid exception. > > Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute > inexact > from the inverse hardfloat operation. This would let us relax the > restriction > of only using hardfloat when we have already have an accrued inexact > exception. > > Only after all of these are done is it worth experimenting with caching t= he > last fp operation. > > > r~ > --=20 =E6=AD=A4=E8=87=B4 =E7=A4=BC =E7=BD=97=E5=8B=87=E5=88=9A Yours sincerely, Yonggang Luo --0000000000006ee9a105a498b0bb Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable


=
On Fri, May 1, 2020 at 10:18 PM Richa= rd Henderson <richard.he= nderson@linaro.org> wrote:
On 5/1/20 6:10 AM, Alex Benn=C3=A9e wrote:
>
> =E7=BD=97=E5=8B=87=E5=88=9A(Yonggang Luo) <luoyonggang@gmail.com> writes: >
>> On Fri, May 1, 2020 at 7:58 PM BALATON Zoltan <balaton@eik.bme.hu> wrote: >>
>>> On Fri, 1 May 2020, =E7=BD=97=E5=8B=87=E5=88=9A(Yonggang Luo) = wrote:
>>>> That's what I suggested,
>>>> We preserve a=C2=A0 float computing cache
>>>> typedef struct FpRecord {
>>>>=C2=A0 uint8_t op;
>>>>=C2=A0 float32 A;
>>>>=C2=A0 float32 B;
>>>> }=C2=A0 FpRecord;
>>>> FpRecord fp_cache[1024];
>>>> int fp_cache_length;
>>>> uint32_t fp_exceptions;
>>>>
>>>> 1. For each new fp operation we push it to the=C2=A0 fp_ca= che,
>>>> 2. Once we read the fp_exceptions , then we re-compute
>>>> the fp_exceptions by re-running the fp FpRecord sequence.<= br> >>>> and clear=C2=A0 fp_cache_length.
>>>
>>> Why do you need to store more than the last fp op? The cumulat= ive bits can
>>> be tracked like it's done for other targets by not clearin= g fp_status then
>>> you can read it from there. Only the non-sticky FI bit needs t= o be
>>> computed but that's only determined by the last op so it&#= 39;s enough to
>>> remember that and run that with softfloat (or even hardfloat a= fter
>>> clearing status but softfloat may be faster for this) to get t= he bits for
>>> last op when status is read.
>>>
>> Yeap, store only the last fp op is also an option. Do you means th= at store
>> the last fp op,
>> and calculate it when necessary?=C2=A0 I am thinking about a gener= al fp
>> optmize method that suite
>> for all target.
>
> I think that's getting a little ahead of yourself. Let's prove= the
> technique is valuable for PPC (given it has the most to gain). We can<= br> > always generalise later if it's worthwhile.

Indeed.

> Rather than creating a new structure I would suggest creating 3 new tc= g
> globals (op, inA, inB) and re-factor the front-end code so each FP op<= br> > loaded the TCG globals. The TCG optimizer should pick up aliased loads=
> and automatically eliminate the dead ones. We might need some new
> machinery for the TCG to avoid spilling the values over potentially > faulting loads/stores but that is likely a phase 2 problem.

There's no point in new tcg globals.

Every fp operation can raise an exception, and therefore every fp operation=
will flush tcg globals to memory.=C2=A0 Therefore there is no optimization = to be
done at the tcg opcode level.

However, every fp operation calls a helper function, and the quickest thing= to
do is store the inputs to env->(op, inA, inB, inC) in the helper before<= br> performing the operation.


> Next you will want to find places that care about the per-op bits of > cpu_fpscr and call a helper with the new globals to re-run the
> computation and feed the values in.

Before we even get to this deferred fp operation thing, there are several g= iant
improvements to ppc emulation that can be made:

Step 1 is to rearrange the fp helpers to eliminate helper_reset_fpstatus().=
I've mentioned this before, that it's possible to leave the steady-= state of
env->fp_status.exception_flags =3D=3D 0, so there's no need for a se= parate function
call.=C2=A0 I suspect this is worth a decent speedup by itself.
Hi Richard, what kinds of rearrange the fp need to be done? Can y= ou give me a more detailed
example? I am still not get the idea.<= /div>

Step 2 is to notice when all fp exceptions are masked, so that no exception= can
be raised, and set a tb_flags bit.=C2=A0 This is the default fp environment= that
libc enables and therefore extremely common.

Currently, ppc has 3 helpers called per fp operation.=C2=A0 If step 1 is ha= ndled
correctly, then we're down to 2 fp helpers per fp operation.=C2=A0 If n= o exceptions
need raising, then we can perform the entire operation with a single functi= on call.

We would require a parallel set of fp helpers that (1) performs the operati= on
and (2) does any post-processing of the exception bits straight away, but (= 3)
without raising any exceptions.=C2=A0 Sort of like helper_fadd +
do_float_check_status, but less.=C2=A0 IIRC the only real extra work is cat= egorizing
invalid exceptions.=C2=A0 We could even plausibly extend softfloat to do th= at while
it is recording the invalid exception.

Step 3 is to improve softfloat.c with Yonggang Luo's idea to compute in= exact
from the inverse hardfloat operation.=C2=A0 This would let us relax the res= triction
of only using hardfloat when we have already have an accrued inexact except= ion.

Only after all of these are done is it worth experimenting with caching the=
last fp operation.


r~


--
=C2=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 =E6=AD=A4=E8= =87=B4
=E7=A4=BC
=E7=BD=97=E5=8B=87=E5=88=9A
Yours
=C2=A0 =C2= =A0 sincerely,
Yonggang Luo
--0000000000006ee9a105a498b0bb--