From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:56134) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZDUEm-0004P8-NX for qemu-devel@nongnu.org; Fri, 10 Jul 2015 05:04:54 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZDUEh-00074c-NZ for qemu-devel@nongnu.org; Fri, 10 Jul 2015 05:04:52 -0400 Received: from mail-ig0-f180.google.com ([209.85.213.180]:35801) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZDUEh-00074V-Hh for qemu-devel@nongnu.org; Fri, 10 Jul 2015 05:04:47 -0400 Received: by igcqs7 with SMTP id qs7so8500909igc.0 for ; Fri, 10 Jul 2015 02:04:47 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <559F84C2.1090109@greensocs.com> References: <1436516626-8322-1-git-send-email-a.rigo@virtualopensystems.com> <559F84C2.1090109@greensocs.com> Date: Fri, 10 Jul 2015 11:04:46 +0200 Message-ID: From: alvise rigo Content-Type: text/plain; charset=UTF-8 Subject: Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Frederic Konrad Cc: mttcg@listserver.greensocs.com, Claudio Fontana , QEMU Developers , Paolo Bonzini , Jani Kokkonen , VirtualOpenSystems Technical Team , =?UTF-8?B?QWxleCBCZW5uw6ll?= On Fri, Jul 10, 2015 at 10:39 AM, Frederic Konrad wrote: > On 10/07/2015 10:23, Alvise Rigo wrote: >> >> This is the third iteration of the patch series; starting from PATCH 007 >> there are the changes to move the whole work to multi-threading. >> Changes versus previous versions are at the bottom of this cover letter. >> >> This patch series provides an infrastructure for atomic >> instruction implementation in QEMU, paving the way for TCG >> multi-threading. >> The adopted design does not rely on host atomic >> instructions and is intended to propose a 'legacy' solution for >> translating guest atomic instructions. >> >> The underlying idea is to provide new TCG instructions that guarantee >> atomicity to some memory accesses or in general a way to define memory >> transactions. More specifically, a new pair of TCG instructions are >> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as >> LoadLink and StoreConditional primitives (only 32 bit variant >> implemented). In order to achieve this, a new bitmap is added to the >> ram_list structure (always unique) which flags all memory pages that >> could not be accessed directly through the fast-path, due to previous >> exclusive operations. This new bitmap is coupled with a new TLB flag >> which forces the slow-path execution. All stores which are performed >> between an LL/SC operation by other vCPUs to the same (protected) address >> will fail the subsequent StoreConditional. >> >> In theory, the provided implementation of TCG LoadLink/StoreConditional >> can be used to properly handle atomic instructions on any architecture. >> >> The new slow-path is implemented such that: >> - the LoadLink behaves as a normal load slow-path, except for cleaning >> the dirty flag in the bitmap. The TLB entries created from now on will >> force the slow-path. To ensure it, we flush the TLB cache for the >> other vCPUs. The vCPU also sets into a private variable the accessed >> address, in order to make it visible to the other vCPUs >> - the StoreConditional behaves as a normal store slow-path, except for >> checking whether other vCPUs have set the same exclusive address >> >> All those write accesses that are forced to follow the 'legacy' >> slow-path will set the accessed memory page to dirty. >> >> In this series only the ARM ldrex/strex instructions are implemented >> for ARM and i386 hosts. >> The code has been tested with bare-metal test cases and by booting Linux, >> using the latest mttcg QEMU branch available at >> http://git.greensocs.com/fkonrad/mttcg.git. > > branch multi_tcg_v6 at this time. > >> >> * Performance considerations >> This implementation shows good results while booting a Linux kernel, >> where tons of flushes affect the overall performance. A complete ARM >> Linux boot, without any filesystem, requires 30% longer if compared to >> the mttcg implementation, benefiting however of being capable to offer >> the infrastructure to handle atomic instructions on any architecture. >> Instead compared to the current TCG upstream, it is 40% faster with four >> vCPUs and 2.1 times faster with 8 vCPUs. >> In addition, there is still margin to improve such performance, since at >> the moment TLB is flushed quite often, probably more than the required. >> >> On the other hand, the test case >> https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git >> that stresses heavily the LL/SC mechanic but not that much the TLB related >> part, performs up to 1.9 times faster with 8 cores and one milion >> iterations >> if compared with the mttcg implementation. >> >> Changes from v2: >> - the bitmap accessors are now atomic >> - a rendezvous between vCPUs and a simple callback support before >> executing >> a TB have been added to handle the TLB flush support > > Isn't exactly what my async_safe_work is supposed to do? Hi Frederic, I've started this implementation with your v4 and I've missed this feature while porting to v6. I think it's doable, it will make things simpler and cleaner. Thank you, alvise > > >> - the softmmu_template and softmmu_llsc_template have been adapted to work >> on real multi-threading >> >> Changes from v1: >> - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive >> - The way how the offset to access the bitmap is calculated has >> been improved and fixed >> - A page to be set as dirty requires a vCPU to target the protected >> address >> and not just an address in the page >> - Addressed comments from Richard Henderson to improve the logic in >> softmmu_template.h and to simplify the methods generation through >> softmmu_llsc_template.h >> - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386 >> >> This work has been sponsored by Huawei Technologies Duesseldorf GmbH. >> >> Alvise Rigo (13): >> exec: Add new exclusive bitmap to ram_list >> cputlb: Add new TLB_EXCL flag >> softmmu: Add helpers for a new slow-path >> tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions >> target-arm: translate: implement qemu_ldlink and qemu_stcond ops >> target-i386: translate: implement qemu_ldlink and qemu_stcond ops >> ram_addr.h: Make exclusive bitmap accessors atomic >> exec.c: introduce a simple rendezvous support >> cpus.c: introduce simple callback support >> Simple TLB flush wrap to use as exit callback >> Introduce exit_flush_req and tcg_excl_access_lock >> softmmu_llsc_template.h: move to multithreading >> softmmu_template.h: move to multithreading >> >> cpus.c | 39 ++++++++ >> cputlb.c | 33 +++++- >> exec.c | 46 +++++++++ >> include/exec/cpu-all.h | 2 + >> include/exec/cpu-defs.h | 8 ++ >> include/exec/memory.h | 3 +- >> include/exec/ram_addr.h | 22 ++++ >> include/qom/cpu.h | 37 +++++++ >> softmmu_llsc_template.h | 184 ++++++++++++++++++++++++++++++++++ >> softmmu_template.h | 261 >> +++++++++++++++++++++++++++++++++++------------- >> target-arm/translate.c | 87 +++++++++++++++- >> tcg/arm/tcg-target.c | 121 ++++++++++++++++------ >> tcg/i386/tcg-target.c | 136 +++++++++++++++++++++---- >> tcg/tcg-be-ldst.h | 1 + >> tcg/tcg-op.c | 23 +++++ >> tcg/tcg-op.h | 3 + >> tcg/tcg-opc.h | 4 + >> tcg/tcg.c | 2 + >> tcg/tcg.h | 20 ++++ >> 19 files changed, 910 insertions(+), 122 deletions(-) >> create mode 100644 softmmu_llsc_template.h >> >