Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation

From: alvise rigo <a.rigo@virtualopensystems.com>
To: Frederic Konrad <fred.konrad@greensocs.com>
Cc: mttcg@listserver.greensocs.com,
	"Claudio Fontana" <claudio.fontana@huawei.com>,
	"QEMU Developers" <qemu-devel@nongnu.org>,
	"Paolo Bonzini" <pbonzini@redhat.com>,
	"Jani Kokkonen" <jani.kokkonen@huawei.com>,
	"VirtualOpenSystems Technical Team" <tech@virtualopensystems.com>,
	"Alex Bennée" <alex.bennee@linaro.org>
Subject: Re: [Qemu-devel] [RFC v3 00/13] Slow-path for atomic instruction translation
Date: Fri, 10 Jul 2015 11:04:46 +0200	[thread overview]
Message-ID: <CAH47eN0-RpXTLymxU=9kA2RoGF2kZC6h5BFJ42bSudO5c8Q19g@mail.gmail.com> (raw)
In-Reply-To: <559F84C2.1090109@greensocs.com>

On Fri, Jul 10, 2015 at 10:39 AM, Frederic Konrad
<fred.konrad@greensocs.com> wrote:
> On 10/07/2015 10:23, Alvise Rigo wrote:
>>
>> This is the third iteration of the patch series; starting from PATCH 007
>> there are the changes to move the whole work to multi-threading.
>> Changes versus previous versions are at the bottom of this cover letter.
>>
>> This patch series provides an infrastructure for atomic
>> instruction implementation in QEMU, paving the way for TCG
>> multi-threading.
>> The adopted design does not rely on host atomic
>> instructions and is intended to propose a 'legacy' solution for
>> translating guest atomic instructions.
>>
>> The underlying idea is to provide new TCG instructions that guarantee
>> atomicity to some memory accesses or in general a way to define memory
>> transactions. More specifically, a new pair of TCG instructions are
>> implemented, qemu_ldlink_i32 and qemu_stcond_i32, that behave as
>> LoadLink and StoreConditional primitives (only 32 bit variant
>> implemented).  In order to achieve this, a new bitmap is added to the
>> ram_list structure (always unique) which flags all memory pages that
>> could not be accessed directly through the fast-path, due to previous
>> exclusive operations. This new bitmap is coupled with a new TLB flag
>> which forces the slow-path execution. All stores which are performed
>> between an LL/SC operation by other vCPUs to the same (protected) address
>> will fail the subsequent StoreConditional.
>>
>> In theory, the provided implementation of TCG LoadLink/StoreConditional
>> can be used to properly handle atomic instructions on any architecture.
>>
>> The new slow-path is implemented such that:
>> - the LoadLink behaves as a normal load slow-path, except for cleaning
>>    the dirty flag in the bitmap. The TLB entries created from now on will
>>    force the slow-path. To ensure it, we flush the TLB cache for the
>>    other vCPUs. The vCPU also sets into a private variable the accessed
>>    address, in order to make it visible to the other vCPUs
>> - the StoreConditional behaves as a normal store slow-path, except for
>>    checking whether other vCPUs have set the same exclusive address
>>
>> All those write accesses that are forced to follow the 'legacy'
>> slow-path will set the accessed memory page to dirty.
>>
>> In this series only the ARM ldrex/strex instructions are implemented
>> for ARM and i386 hosts.
>> The code has been tested with bare-metal test cases and by booting Linux,
>> using the latest mttcg QEMU branch available at
>> http://git.greensocs.com/fkonrad/mttcg.git.
>
> branch multi_tcg_v6 at this time.
>
>>
>> * Performance considerations
>> This implementation shows good results while booting a Linux kernel,
>> where tons of flushes affect the overall performance. A complete ARM
>> Linux boot, without any filesystem, requires 30% longer if compared to
>> the mttcg implementation, benefiting however of being capable to offer
>> the infrastructure to handle atomic instructions on any architecture.
>> Instead compared to the current TCG upstream, it is 40% faster with four
>> vCPUs and 2.1 times faster with 8 vCPUs.
>> In addition, there is still margin to improve such performance, since at
>> the moment TLB is flushed quite often, probably more than the required.
>>
>> On the other hand, the test case
>> https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git
>> that stresses heavily the LL/SC mechanic but not that much the TLB related
>> part, performs up to 1.9 times faster with 8 cores and one milion
>> iterations
>> if compared with the mttcg implementation.
>>
>> Changes from v2:
>> - the bitmap accessors are now atomic
>> - a rendezvous between vCPUs and a simple callback support before
>> executing
>>    a TB have been added to handle the TLB flush support
>
> Isn't exactly what my async_safe_work is supposed to do?

Hi Frederic,

I've started this implementation with your v4 and I've missed this
feature while porting to v6.
I think it's doable, it will make things simpler and cleaner.

Thank you,
alvise

>
>
>> - the softmmu_template and softmmu_llsc_template have been adapted to work
>>    on real multi-threading
>>
>> Changes from v1:
>> - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
>> - The way how the offset to access the bitmap is calculated has
>>    been improved and fixed
>> - A page to be set as dirty requires a vCPU to target the protected
>> address
>>    and not just an address in the page
>> - Addressed comments from Richard Henderson to improve the logic in
>>    softmmu_template.h and to simplify the methods generation through
>>    softmmu_llsc_template.h
>> - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386
>>
>> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>>
>> Alvise Rigo (13):
>>    exec: Add new exclusive bitmap to ram_list
>>    cputlb: Add new TLB_EXCL flag
>>    softmmu: Add helpers for a new slow-path
>>    tcg-op: create new TCG qemu_ldlink and qemu_stcond instructions
>>    target-arm: translate: implement qemu_ldlink and qemu_stcond ops
>>    target-i386: translate: implement qemu_ldlink and qemu_stcond ops
>>    ram_addr.h: Make exclusive bitmap accessors atomic
>>    exec.c: introduce a simple rendezvous support
>>    cpus.c: introduce simple callback support
>>    Simple TLB flush wrap to use as exit callback
>>    Introduce exit_flush_req and tcg_excl_access_lock
>>    softmmu_llsc_template.h: move to multithreading
>>    softmmu_template.h: move to multithreading
>>
>>   cpus.c                  |  39 ++++++++
>>   cputlb.c                |  33 +++++-
>>   exec.c                  |  46 +++++++++
>>   include/exec/cpu-all.h  |   2 +
>>   include/exec/cpu-defs.h |   8 ++
>>   include/exec/memory.h   |   3 +-
>>   include/exec/ram_addr.h |  22 ++++
>>   include/qom/cpu.h       |  37 +++++++
>>   softmmu_llsc_template.h | 184 ++++++++++++++++++++++++++++++++++
>>   softmmu_template.h      | 261
>> +++++++++++++++++++++++++++++++++++-------------
>>   target-arm/translate.c  |  87 +++++++++++++++-
>>   tcg/arm/tcg-target.c    | 121 ++++++++++++++++------
>>   tcg/i386/tcg-target.c   | 136 +++++++++++++++++++++----
>>   tcg/tcg-be-ldst.h       |   1 +
>>   tcg/tcg-op.c            |  23 +++++
>>   tcg/tcg-op.h            |   3 +
>>   tcg/tcg-opc.h           |   4 +
>>   tcg/tcg.c               |   2 +
>>   tcg/tcg.h               |  20 ++++
>>   19 files changed, 910 insertions(+), 122 deletions(-)
>>   create mode 100644 softmmu_llsc_template.h
>>
>