Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation

From: alvise rigo <a.rigo@virtualopensystems.com>
To: "Alex Bennée" <alex.bennee@linaro.org>
Cc: mttcg@listserver.greensocs.com,
	Claudio Fontana <claudio.fontana@huawei.com>,
	QEMU Developers <qemu-devel@nongnu.org>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Jani Kokkonen <jani.kokkonen@huawei.com>,
	VirtualOpenSystems Technical Team <tech@virtualopensystems.com>,
	Richard Henderson <rth@twiddle.net>
Subject: Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
Date: Thu, 17 Dec 2015 17:16:33 +0100	[thread overview]
Message-ID: <CAH47eN0ShA_RzO251X04gh231+FGmyHemk+AgqJyS6nuukxy2A@mail.gmail.com> (raw)
In-Reply-To: <87si31f4a8.fsf@linaro.org>

[-- Attachment #1: Type: text/plain, Size: 8799 bytes --]

Hi Alex,

On Thu, Dec 17, 2015 at 5:06 PM, Alex Bennée <alex.bennee@linaro.org> wrote:

>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
> > This is the sixth iteration of the patch series which applies to the
> > upstream branch of QEMU (v2.5.0-rc3).
> >
> > Changes versus previous versions are at the bottom of this cover letter.
> >
> > The code is also available at following repository:
> > https://git.virtualopensystems.com/dev/qemu-mt.git
> > branch:
> > slowpath-for-atomic-v6-no-mttcg
>
> I'm starting to look through this now. However one problem that
>

Thank you for this.

> immediately comes up is the aarch64 breakage. Because there is an
> intrinsic link between a lot of the arm and aarch64 code it breaks the
> other targets.
>
> You could fix this by ensuring that CONFIG_TCG_USE_LDST_EXCL doesn't get
> passed to the aarch64 build (tricky as aarch64-softmmu.mak includes
> arm-softmmu.mak) or bite the bullet now and add the 64 bit helpers that
> will be needed to convert the aarch64 exclusive equivalents.
>

This is what I'm doing right now :)

Best regards,
alvise

>
> >
> > This patch series provides an infrastructure for atomic instruction
> > implementation in QEMU, thus offering a 'legacy' solution for
> > translating guest atomic instructions. Moreover, it can be considered as
> > a first step toward a multi-thread TCG.
> >
> > The underlying idea is to provide new TCG helpers (sort of softmmu
> > helpers) that guarantee atomicity to some memory accesses or in general
> > a way to define memory transactions.
> >
> > More specifically, the new softmmu helpers behave as LoadLink and
> > StoreConditional instructions, and are called from TCG code by means of
> > target specific helpers. This work includes the implementation for all
> > the ARM atomic instructions, see target-arm/op_helper.c.
> >
> > The implementation heavily uses the software TLB together with a new
> > bitmap that has been added to the ram_list structure which flags, on a
> > per-CPU basis, all the memory pages that are in the middle of a LoadLink
> > (LL), StoreConditional (SC) operation.  Since all these pages can be
> > accessed directly through the fast-path and alter a vCPU's linked value,
> > the new bitmap has been coupled with a new TLB flag for the TLB virtual
> > address which forces the slow-path execution for all the accesses to a
> > page containing a linked address.
> >
> > The new slow-path is implemented such that:
> > - the LL behaves as a normal load slow-path, except for clearing the
> >   dirty flag in the bitmap.  The cputlb.c code while generating a TLB
> >   entry, checks if there is at least one vCPU that has the bit cleared
> >   in the exclusive bitmap, it that case the TLB entry will have the EXCL
> >   flag set, thus forcing the slow-path.  In order to ensure that all the
> >   vCPUs will follow the slow-path for that page, we flush the TLB cache
> >   of all the other vCPUs.
> >
> >   The LL will also set the linked address and size of the access in a
> >   vCPU's private variable. After the corresponding SC, this address will
> >   be set to a reset value.
> >
> > - the SC can fail returning 1, or succeed, returning 0.  It has to come
> >   always after a LL and has to access the same address 'linked' by the
> >   previous LL, otherwise it will fail. If in the time window delimited
> >   by a legit pair of LL/SC operations another write access happens to
> >   the linked address, the SC will fail.
> >
> > In theory, the provided implementation of TCG LoadLink/StoreConditional
> > can be used to properly handle atomic instructions on any architecture.
> >
> > The code has been tested with bare-metal test cases and by booting Linux.
> >
> > * Performance considerations
> > The new slow-path adds some overhead to the translation of the ARM
> > atomic instructions, since their emulation doesn't happen anymore only
> > in the guest (by mean of pure TCG generated code), but requires the
> > execution of two helpers functions. Despite this, the additional time
> > required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
> > negligible.
> > Instead, on a LL/SC bound test scenario - like:
> > https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
> > solution requires 30% (1 million iterations) and 70% (10 millions
> > iterations) of additional time for the test to complete.
> >
> > Changes from v5:
> > - The exclusive memory region is now set through a CPUClass hook,
> >   allowing any architecture to decide the memory area that will be
> >   protected during a LL/SC operation [PATCH 3]
> > - The runtime helpers dropped any target dependency and are now in a
> >   common file [PATCH 5]
> > - Improved the way we restore a guest page as non-exclusive [PATCH 9]
> > - Included MMIO memory as possible target of LL/SC
> >   instructions. This also required to somehow simplify the
> >   helper_*_st_name helpers in softmmu_template.h [PATCH 8-14]
> >
> > Changes from v4:
> > - Reworked the exclusive bitmap to be of fixed size (8 bits per address)
> > - The slow-path is now TCG backend independent, no need to touch
> >   tcg/* anymore as suggested by Aurelien Jarno.
> >
> > Changes from v3:
> > - based on upstream QEMU
> > - addressed comments from Alex Bennée
> > - the slow path can be enabled by the user with:
> >   ./configure --enable-tcg-ldst-excl only if the backend supports it
> > - all the ARM ldex/stex instructions make now use of the slow path
> > - added aarch64 TCG backend support
> > - part of the code has been rewritten
> >
> > Changes from v2:
> > - the bitmap accessors are now atomic
> > - a rendezvous between vCPUs and a simple callback support before
> executing
> >   a TB have been added to handle the TLB flush support
> > - the softmmu_template and softmmu_llsc_template have been adapted to
> work
> >   on real multi-threading
> >
> > Changes from v1:
> > - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
> > - The way how the offset to access the bitmap is calculated has
> >   been improved and fixed
> > - A page to be set as dirty requires a vCPU to target the protected
> address
> >   and not just an address in the page
> > - Addressed comments from Richard Henderson to improve the logic in
> >   softmmu_template.h and to simplify the methods generation through
> >   softmmu_llsc_template.h
> > - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386
> >
> > This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
> >
> > Alvise Rigo (14):
> >   exec.c: Add new exclusive bitmap to ram_list
> >   softmmu: Add new TLB_EXCL flag
> >   Add CPUClass hook to set exclusive range
> >   softmmu: Add helpers for a new slowpath
> >   tcg: Create new runtime helpers for excl accesses
> >   configure: Use slow-path for atomic only when the softmmu is enabled
> >   target-arm: translate: Use ld/st excl for atomic insns
> >   target-arm: Add atomic_clear helper for CLREX insn
> >   softmmu: Add history of excl accesses
> >   softmmu: Simplify helper_*_st_name, wrap unaligned code
> >   softmmu: Simplify helper_*_st_name, wrap MMIO code
> >   softmmu: Simplify helper_*_st_name, wrap RAM code
> >   softmmu: Include MMIO/invalid exclusive accesses
> >   softmmu: Protect MMIO exclusive range
> >
> >  Makefile.target             |   2 +-
> >  configure                   |   4 +
> >  cputlb.c                    |  67 ++++++++-
> >  exec.c                      |   8 +-
> >  include/exec/cpu-all.h      |   8 ++
> >  include/exec/cpu-defs.h     |   1 +
> >  include/exec/helper-gen.h   |   1 +
> >  include/exec/helper-proto.h |   1 +
> >  include/exec/helper-tcg.h   |   1 +
> >  include/exec/memory.h       |   4 +-
> >  include/exec/ram_addr.h     |  76 ++++++++++
> >  include/qom/cpu.h           |  21 +++
> >  qom/cpu.c                   |   7 +
> >  softmmu_llsc_template.h     | 144 +++++++++++++++++++
> >  softmmu_template.h          | 338
> +++++++++++++++++++++++++++++++++-----------
> >  target-arm/helper.h         |   2 +
> >  target-arm/op_helper.c      |   6 +
> >  target-arm/translate.c      | 102 ++++++++++++-
> >  tcg-llsc-helper.c           | 109 ++++++++++++++
> >  tcg-llsc-helper.h           |  35 +++++
> >  tcg/tcg-llsc-gen-helper.h   |  32 +++++
> >  tcg/tcg.h                   |  31 ++++
> >  22 files changed, 909 insertions(+), 91 deletions(-)
> >  create mode 100644 softmmu_llsc_template.h
> >  create mode 100644 tcg-llsc-helper.c
> >  create mode 100644 tcg-llsc-helper.h
> >  create mode 100644 tcg/tcg-llsc-gen-helper.h
>
>
> --
> Alex Bennée
>

[-- Attachment #2: Type: text/html, Size: 10929 bytes --]