From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.5 required=3.0 tests=DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2B9DAC35DEE for ; Tue, 25 Feb 2020 03:07:53 +0000 (UTC) Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id D85AB22525 for ; Tue, 25 Feb 2020 03:07:52 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=fail reason="signature verification failed" (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="Qe7eGmzv" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D85AB22525 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Received: from localhost ([::1]:48146 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1j6QZc-00007L-1X for qemu-devel@archiver.kernel.org; Mon, 24 Feb 2020 22:07:52 -0500 Received: from eggs.gnu.org ([2001:470:142:3::10]:45487) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1j6QYx-00081v-3X for qemu-devel@nongnu.org; Mon, 24 Feb 2020 22:07:12 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1j6QYv-0002H5-7z for qemu-devel@nongnu.org; Mon, 24 Feb 2020 22:07:11 -0500 Received: from mail-yw1-xc44.google.com ([2607:f8b0:4864:20::c44]:35519) by eggs.gnu.org with esmtps (TLS1.0:RSA_AES_128_CBC_SHA1:16) (Exim 4.71) (envelope-from ) id 1j6QYv-0002G8-2n; Mon, 24 Feb 2020 22:07:09 -0500 Received: by mail-yw1-xc44.google.com with SMTP id i190so6447916ywc.2; Mon, 24 Feb 2020 19:07:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=NbUeFaSg0C5X62e8QRnrtiyK1p6W25fLbDyd7UPOuFc=; b=Qe7eGmzvo/OcblWJAG5VioPq9Xg5qMjRmI8TfP33s84vH2mE9heILL9ZME/W63hAma V/WNoOrTDTBWQceXkD9Ju/y+7/keX5cCE4Jci+v8Bl50BzWtR6Hi4QGFiWKy/EvLVXj5 F65KRSSDZhuszFxeZZQ+HNpXppR0EXASI0LmebsgBkBNDGU8XlTaIQOAV7saSIrLZb0L T+gZDmGoF9WPUvXx6RnvuNXkfp00m2bu0bt40ltKImgH09oKt3bJYJJxYF/qyWsgIu6h /PtmUoNwgnGW2VWnbgOlYAwPkZlEugHaXzngnA4kEPKhj/rTLDzMA0YtEGTvH5qzOLKN Dm/g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=NbUeFaSg0C5X62e8QRnrtiyK1p6W25fLbDyd7UPOuFc=; b=M4Uz8dnNxS0Pirutl3bS+sn0Q7QTY2Zs0Brub+7AEmbPTgwLpZXTeMsr0UTHDUaa32 SQqP3Dp5dZAjwRiO8c7RABrvVtokNRI0AjrRVwIyl+8u1FQO+6BvcGDbfVIGaqGzdeEY IiScZKElerfNEjwb9KU47L6u+ZAVFAZ2WRbFeptRPwbcARqrnBWI8fsGaxTi18PCB4F5 fVJsLc13lpQFafnOVuwRdTFHdMTc1LF5LzUjM8DuW3EVPcOD/5WtUzLw/N2Pm8j3h/gw 1GkV9iaxDlWdn3O6J3G8/qawf1XUP2rZk6Wk/x4SjW+3Jm03n7zx8mZvPzDumYzo5cej 2DAg== X-Gm-Message-State: APjAAAXIIMUGUqsCnH5Q3BdR9P2uT499FhpPNVanbJvXnsfRwdmI31Sg KKLGmxYXGEIE5izx/ATCmgo= X-Google-Smtp-Source: APXvYqygGvlU6FzlTekjT+il2Vbs2+1eohlfLaep+CFHHc/RKwqNuVYDOrzk35kXjILsdDJcX/vKlg== X-Received: by 2002:a81:f4b:: with SMTP id 72mr44254770ywp.251.1582600027856; Mon, 24 Feb 2020 19:07:07 -0800 (PST) Received: from [192.168.0.5] (d149-67-30-58.try.wideopenwest.com. [67.149.58.30]) by smtp.gmail.com with ESMTPSA id a74sm5764242ywe.42.2020.02.24.19.07.06 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Mon, 24 Feb 2020 19:07:06 -0800 (PST) Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 10.3 \(3273\)) Subject: Re: [RFC PATCH v2] target/ppc: Enable hardfloat for PPC From: Programmingkid In-Reply-To: Date: Mon, 24 Feb 2020 22:07:05 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: <1BC2E9E9-A694-4ED3-BD3D-D731F23B7245@gmail.com> References: <20200218171702.979F074637D@zero.eik.bme.hu> To: BALATON Zoltan X-Mailer: Apple Mail (2.3273) X-detected-operating-system: by eggs.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::c44 X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Howard Spoelstra , qemu-ppc@nongnu.org, qemu-devel qemu-devel , David Gibson , Paul Clarke Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" > On Feb 19, 2020, at 10:35 AM, BALATON Zoltan = wrote: >=20 > Hello, >=20 > On Tue, 18 Feb 2020, Programmingkid wrote: >>> On Feb 18, 2020, at 12:10 PM, BALATON Zoltan = wrote: >>> While other targets take advantage of using host FPU to do floating >>> point computations, this was disabled for PPC target because always >>> clearing exception flags before every FP op made it slightly slower >>> than emulating everyting with softfloat. To emulate some FPSCR bits, >>> clearing of fp_status may be necessary (unless these could be = handled >>> e.g. using FP exceptions on host but there's no API for that in QEMU >>> yet) but preserving at least the inexact flag makes hardfloat usable >>> and faster than softfloat. Since most clients don't actually care >>> about this flag, we can gain some speed trading some emulation >>> accuracy. >>>=20 >>> This patch implements a simple way to keep the inexact flag set for >>> hardfloat while still allowing to revert to softfloat for workloads >>> that need more accurate albeit slower emulation. (Set hardfloat >>> property of CPU, i.e. -cpu name,hardfloat=3Dfalse for that.) There = may >>> still be room for further improvement but this seems to increase >>> floating point performance. Unfortunately the softfloat case is = slower >>> than before this patch so this patch only makes sense if the default >>> is also set to enable hardfloat. >>>=20 >>> Because of the above this patch at the moment is mainly for testing >>> different workloads to evaluate how viable would this be in = practice. >>> Thus, RFC and not ready for merge yet. >>>=20 >>> Signed-off-by: BALATON Zoltan >>> --- >>> v2: use different approach to avoid needing if () in >>> helper_reset_fpstatus() but this does not seem to change overhead >>> much, also make it a single patch as adding the hardfloat option is >>> only a few lines; with this we can use same value at other places = where >>> float_status is reset and maybe enable hardfloat for a few more = places >>> for a little more performance but not too much. With this I got: >>=20 >> >>=20 >> Thank you for working on this. It is about time we have a better FPU. >=20 > Thank you for testing it. I think it would be great if we could come = up with some viable approach to improve this before the next freeze. >=20 >> I applied your patch over David Gibson's ppc-for-5.0 branch. It = applied cleanly and compiled easily. >=20 > I've heard some preliminary results from others that there's also a = difference between v1 and v2 of the patch in performance where v1 may be = faster for same cases so if you (or someone else) want and have time you = could experiment with different versions and combinations as well to = find the one that's best on all CPUs. Basically we have these parts: >=20 > 1. Change target/ppc/fpu_helper.c::helper_reset_fpstatus() to force = float_flag_inexact on in case hadfloat is enabled, I've tried two = approaches for this: >=20 > a. In v1 added an if () in the function > b. In v2 used a variable from env set earlier (I've hoped this may be = faster but maybe it's not, testing and explanation is welcome) >=20 > 2. Also change places where env->fp_status is copied to a local tstat = and then reset (I think this is done to accumulate flags from multiple = FP ops that would individually reset env->fp_status or some other = reason, maybe this could be avoided if we reset fp_status less often but = that would need more understanding of the FP emulation that I don't have = so I did not try to clean that up yet). >=20 > If v2 is really slower than v1 then I'm not sure is it because also = changing places with tstat or because of the different approach in = helper_reset_fpstatus() so you could try combinations of these as well. >=20 >> Tests were done on a Mac OS 10.4.3 VM. The CPU was set to G3. >=20 > What was the host CPU and OS this was tested on? Please always share = CPU info and host OS when sharing bechmark results so they are somewhat = comparable. It also depends on CPU features for vector instrucions at = least so without CPU info the results could not be understood. Intel Core i5-2500S CPU @ 2.70GHz. > I think G3 does not have AltiVec/VMX so maybe testing with G4 would be = better to also test those ops unless there's a reason to only test G3. = I've tested with G4 both FPU only and FPU+VMX code on Linux host with = i7-9700K CPU @ 3.60GHz as was noted in the original cover letter but may = be I'va also forgotten some details so I list it here again. Ok, I did test on the G4, here are my results: Git commit: c1e667d2598b9b3ce62b8e89ed22dd38dfe9f57f Mac OS 10.4.3 VM -cpu G4 -USB audio device Hardfloat=3Dfalse Audio sounds bad when playing midi file. Extraction rate: 1.5x Converting rate: 0.7x Total time: 7:24 Hardfloat=3Dtrue Midi audio sounded perfect for about 30 seconds, then it went silent! Extraction rate: 1.4x (slower with hard float) Converting rate: 0.7x (same as without hardfloat) Total time: 7:16 (faster time with hardfloat) >=20 >> I did several tests and here are my results: >>=20 >> With hard float: >> - The USB audio device does not produce any sound. >=20 > I've heard this could also be due to some other problem not directly = related to FPU, maybe there's a problem with USB/OHCI emulation as well = because problems with that were reported but it's interesting why you = get different results changing FPU related stuff. I think OSX uses float = samples so probably does use FPU for processing sound and may rely on = some pecularity of the hardware as it was probably optimised for Apple = hardware. It would be interesting to find out how FPU stuff is related = to this but since it's broken anyway probably not a show stopper at the = moment. When I played sound this second time I hard the same broken audio I = usually hear with the USB audio device with hardfloat set to false. When = playing the same midi file with hardfloat set to true, the audio played = perfectly! It only played for 30 seconds before it went silent. >=20 >> - Converting a MIDI file to AAC in iTunes happens at 0.4x (faster = than soft float :) ). >=20 > Does resulting file match? As a simple test I've verified md5sum of = the resulting mp3 with the lame benchmark I've tried just to find any = big errors. Even if it does not prove that nothing broke, it shuold = detect if something breaks badly. However that was WAV->MP3 where = results were same, although the VMX build did produce different result = than FPU only but did so consistently for multiple runs. With MIDI there = might be slight timing difference that could cause different audio = results so you should first verify if doing the conversion multiple = times does produce the same result at all without any patch first. The md5 number for each file does match. Note this time I converted a = midi file to MP3. >=20 >> For my FPSCR test program, 21 tests failed. The high number is = because the inexact exception is being set for situations it should not = be set for. >=20 > Since we force the inexact flag to be set to enable hardfloat this is = expected. More interesting is if apart from this are there any = difference in the results compared to the soffloat case (that may also = be host CPU dependent I think). Do you have more detailed info on the = errors and differences found? I can give you the full testing suite if you like. I run it on Mac OS = 10.4 but it should compile with gcc on Linux. I will send it to you in a = separate email because it is big. >> With soft float: - Some sound can be heard from the USB audio device. = It isn't good sounding. I had to force quit Quicktime player because it = stopped working. >> - Converting a MIDI file to AAC in iTunes happens at 0.3x (slower = than hard float). >> - 13 tests failed with my FPSCR test program. >>=20 >> This patch is a good start. I'm not worried about the Floating Point = Status and Control Register flags being wrong since hardly any software = bothers to check them. I think more optimizations can happen by >=20 > I don't know if guest code checks fpscr and what flags it cares about. = Also don't know if it's a fact that these are not used but maybe if we = test with more guest codes we can find out. That's why I'd like to at = least have an option to test with hardfloat. Unfortunately enabling = hardfloat without also making it default would make it slower so if we = go this way we should make sure we can also enable hardfloat as default. >=20 >> simplifying the FPU. As it is now it makes a lot of calls per = operation. >=20 > Question is if those calls are really needed to emulate PPC FPU or if = not why would they be there? If the FPU is really that much different so = all these calls are needed then there's not much to simplify (but maybe = there could be some optimisations possible). This would need someone to = understand the current code in full first that probably we don't = currently (ar least I don't for sure so can't really make changes = either). Another more viable approach is to pick a small part and follow = through with that and try to clean up and optimise that small part only. = The exception and fpscr handling is one such part, another could be = round_canonical() that seems to be high on profiles I've taken. Maybe = this could be done by reading and understading docs only on the small = part picked that may be easier than getting everything first. I wonder = if such smaller tasks could be defined and given out as GSoC or other = volunteer projects? I have another idea on how to improve QEMU's performance. What if you = enabled more CPUs for the PowerPC target? Mac OS 9, Mac OS X, and Linux = support multiple CPUs. It might actually be easier to do this than to = improve the FPU. I imagine the performance increase with multiple = emulated CPUs would be much more noticeable.=20