From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50824) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZDTqN-00074w-Mv for qemu-devel@nongnu.org; Fri, 10 Jul 2015 04:39:41 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZDTqJ-00045B-U5 for qemu-devel@nongnu.org; Fri, 10 Jul 2015 04:39:39 -0400 Received: from greensocs.com ([193.104.36.180]:54734) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZDTqJ-00040k-AP for qemu-devel@nongnu.org; Fri, 10 Jul 2015 04:39:35 -0400 Message-ID: <559F84C2.1090109@greensocs.com> Date: Fri, 10 Jul 2015 10:39:30 +0200 From: Frederic Konrad MIME-Version: 1.0 References: <1436516626-8322-1-git-send-email-a.rigo@virtualopensystems.com> In-Reply-To: <1436516626-8322-1-git-send-email-a.rigo@virtualopensystems.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alvise Rigo , qemu-devel@nongnu.org, mttcg@listserver.greensocs.com Cc: alex.bennee@linaro.org, jani.kokkonen@huawei.com, tech@virtualopensystems.com, claudio.fontana@huawei.com, pbonzini@redhat.com On 10/07/2015 10:23, Alvise Rigo wrote: > This is the third iteration of the patch series; starting from PATCH 007 > there are the changes to move the whole work to multi-threading. > Changes versus previous versions are at the bottom of this cover letter. > > This patch series provides an infrastructure for atomic > instruction implementation in QEMU, paving the way for TCG multi-threading. > The adopted design does not rely on host atomic > instructions and is intended to propose a 'legacy' solution for > translating guest atomic instructions. > > The underlying idea is to provide new TCG instructions that guarantee > atomicity to some memory accesses or in general a way to define memory > transactions. More specifically, a new pair of TCG instructions are > implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as > LoadLink and StoreConditional primitives (only 32 bit variant > implemented). In order to achieve this, a new bitmap is added to the > ram_list structure (always unique) which flags all memory pages that > could not be accessed directly through the fast-path, due to previous > exclusive operations. This new bitmap is coupled with a new TLB flag > which forces the slow-path execution. All stores which are performed > between an LL/SC operation by other vCPUs to the same (protected) address > will fail the subsequent StoreConditional. > > In theory, the provided implementation of TCG LoadLink/StoreConditional > can be used to properly handle atomic instructions on any architecture. > > The new slow-path is implemented such that: > - the LoadLink behaves as a normal load slow-path, except for cleaning > the dirty flag in the bitmap. The TLB entries created from now on will > force the slow-path. To ensure it, we flush the TLB cache for the > other vCPUs. The vCPU also sets into a private variable the accessed > address, in order to make it visible to the other vCPUs > - the StoreConditional behaves as a normal store slow-path, except for > checking whether other vCPUs have set the same exclusive address > > All those write accesses that are forced to follow the 'legacy' > slow-path will set the accessed memory page to dirty. > > In this series only the ARM ldrex/strex instructions are implemented > for ARM and i386 hosts. > The code has been tested with bare-metal test cases and by booting Linux, > using the latest mttcg QEMU branch available at > http://git.greensocs.com/fkonrad/mttcg.git. branch multi_tcg_v6 at this time. > > * Performance considerations > This implementation shows good results while booting a Linux kernel, > where tons of flushes affect the overall performance. A complete ARM > Linux boot, without any filesystem, requires 30% longer if compared to > the mttcg implementation, benefiting however of being capable to offer > the infrastructure to handle atomic instructions on any architecture. > Instead compared to the current TCG upstream, it is 40% faster with four > vCPUs and 2.1 times faster with 8 vCPUs. > In addition, there is still margin to improve such performance, since at > the moment TLB is flushed quite often, probably more than the required. > > On the other hand, the test case > https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git > that stresses heavily the LL/SC mechanic but not that much the TLB related > part, performs up to 1.9 times faster with 8 cores and one milion iterations > if compared with the mttcg implementation. > > Changes from v2: > - the bitmap accessors are now atomic > - a rendezvous between vCPUs and a simple callback support before executing > a TB have been added to handle the TLB flush support Isn't exactly what my async_safe_work is supposed to do? > - the softmmu_template and softmmu_llsc_template have been adapted to work > on real multi-threading > > Changes from v1: > - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive > - The way how the offset to access the bitmap is calculated has > been improved and fixed > - A page to be set as dirty requires a vCPU to target the protected address > and not just an address in the page > - Addressed comments from Richard Henderson to improve the logic in > softmmu_template.h and to simplify the methods generation through > softmmu_llsc_template.h > - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386 > > This work has been sponsored by Huawei Technologies Duesseldorf GmbH. > > Alvise Rigo (13): > exec: Add new exclusive bitmap to ram_list > cputlb: Add new TLB_EXCL flag > softmmu: Add helpers for a new slow-path > tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions > target-arm: translate: implement qemu_ldlink and qemu_stcond ops > target-i386: translate: implement qemu_ldlink and qemu_stcond ops > ram_addr.h: Make exclusive bitmap accessors atomic > exec.c: introduce a simple rendezvous support > cpus.c: introduce simple callback support > Simple TLB flush wrap to use as exit callback > Introduce exit_flush_req and tcg_excl_access_lock > softmmu_llsc_template.h: move to multithreading > softmmu_template.h: move to multithreading > > cpus.c | 39 ++++++++ > cputlb.c | 33 +++++- > exec.c | 46 +++++++++ > include/exec/cpu-all.h | 2 + > include/exec/cpu-defs.h | 8 ++ > include/exec/memory.h | 3 +- > include/exec/ram_addr.h | 22 ++++ > include/qom/cpu.h | 37 +++++++ > softmmu_llsc_template.h | 184 ++++++++++++++++++++++++++++++++++ > softmmu_template.h | 261 +++++++++++++++++++++++++++++++++++------------- > target-arm/translate.c | 87 +++++++++++++++- > tcg/arm/tcg-target.c | 121 ++++++++++++++++------ > tcg/i386/tcg-target.c | 136 +++++++++++++++++++++---- > tcg/tcg-be-ldst.h | 1 + > tcg/tcg-op.c | 23 +++++ > tcg/tcg-op.h | 3 + > tcg/tcg-opc.h | 4 + > tcg/tcg.c | 2 + > tcg/tcg.h | 20 ++++ > 19 files changed, 910 insertions(+), 122 deletions(-) > create mode 100644 softmmu_llsc_template.h >