Hi Alex, On Thu, Dec 17, 2015 at 5:06 PM, Alex Bennée wrote: > > Alvise Rigo writes: > > > This is the sixth iteration of the patch series which applies to the > > upstream branch of QEMU (v2.5.0-rc3). > > > > Changes versus previous versions are at the bottom of this cover letter. > > > > The code is also available at following repository: > > https://git.virtualopensystems.com/dev/qemu-mt.git > > branch: > > slowpath-for-atomic-v6-no-mttcg > > I'm starting to look through this now. However one problem that > Thank you for this. > immediately comes up is the aarch64 breakage. Because there is an > intrinsic link between a lot of the arm and aarch64 code it breaks the > other targets. > > You could fix this by ensuring that CONFIG_TCG_USE_LDST_EXCL doesn't get > passed to the aarch64 build (tricky as aarch64-softmmu.mak includes > arm-softmmu.mak) or bite the bullet now and add the 64 bit helpers that > will be needed to convert the aarch64 exclusive equivalents. > This is what I'm doing right now :) Best regards, alvise > > > > > This patch series provides an infrastructure for atomic instruction > > implementation in QEMU, thus offering a 'legacy' solution for > > translating guest atomic instructions. Moreover, it can be considered as > > a first step toward a multi-thread TCG. > > > > The underlying idea is to provide new TCG helpers (sort of softmmu > > helpers) that guarantee atomicity to some memory accesses or in general > > a way to define memory transactions. > > > > More specifically, the new softmmu helpers behave as LoadLink and > > StoreConditional instructions, and are called from TCG code by means of > > target specific helpers. This work includes the implementation for all > > the ARM atomic instructions, see target-arm/op_helper.c. > > > > The implementation heavily uses the software TLB together with a new > > bitmap that has been added to the ram_list structure which flags, on a > > per-CPU basis, all the memory pages that are in the middle of a LoadLink > > (LL), StoreConditional (SC) operation. Since all these pages can be > > accessed directly through the fast-path and alter a vCPU's linked value, > > the new bitmap has been coupled with a new TLB flag for the TLB virtual > > address which forces the slow-path execution for all the accesses to a > > page containing a linked address. > > > > The new slow-path is implemented such that: > > - the LL behaves as a normal load slow-path, except for clearing the > > dirty flag in the bitmap. The cputlb.c code while generating a TLB > > entry, checks if there is at least one vCPU that has the bit cleared > > in the exclusive bitmap, it that case the TLB entry will have the EXCL > > flag set, thus forcing the slow-path. In order to ensure that all the > > vCPUs will follow the slow-path for that page, we flush the TLB cache > > of all the other vCPUs. > > > > The LL will also set the linked address and size of the access in a > > vCPU's private variable. After the corresponding SC, this address will > > be set to a reset value. > > > > - the SC can fail returning 1, or succeed, returning 0. It has to come > > always after a LL and has to access the same address 'linked' by the > > previous LL, otherwise it will fail. If in the time window delimited > > by a legit pair of LL/SC operations another write access happens to > > the linked address, the SC will fail. > > > > In theory, the provided implementation of TCG LoadLink/StoreConditional > > can be used to properly handle atomic instructions on any architecture. > > > > The code has been tested with bare-metal test cases and by booting Linux. > > > > * Performance considerations > > The new slow-path adds some overhead to the translation of the ARM > > atomic instructions, since their emulation doesn't happen anymore only > > in the guest (by mean of pure TCG generated code), but requires the > > execution of two helpers functions. Despite this, the additional time > > required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is > > negligible. > > Instead, on a LL/SC bound test scenario - like: > > https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this > > solution requires 30% (1 million iterations) and 70% (10 millions > > iterations) of additional time for the test to complete. > > > > Changes from v5: > > - The exclusive memory region is now set through a CPUClass hook, > > allowing any architecture to decide the memory area that will be > > protected during a LL/SC operation [PATCH 3] > > - The runtime helpers dropped any target dependency and are now in a > > common file [PATCH 5] > > - Improved the way we restore a guest page as non-exclusive [PATCH 9] > > - Included MMIO memory as possible target of LL/SC > > instructions. This also required to somehow simplify the > > helper_*_st_name helpers in softmmu_template.h [PATCH 8-14] > > > > Changes from v4: > > - Reworked the exclusive bitmap to be of fixed size (8 bits per address) > > - The slow-path is now TCG backend independent, no need to touch > > tcg/* anymore as suggested by Aurelien Jarno. > > > > Changes from v3: > > - based on upstream QEMU > > - addressed comments from Alex Bennée > > - the slow path can be enabled by the user with: > > ./configure --enable-tcg-ldst-excl only if the backend supports it > > - all the ARM ldex/stex instructions make now use of the slow path > > - added aarch64 TCG backend support > > - part of the code has been rewritten > > > > Changes from v2: > > - the bitmap accessors are now atomic > > - a rendezvous between vCPUs and a simple callback support before > executing > > a TB have been added to handle the TLB flush support > > - the softmmu_template and softmmu_llsc_template have been adapted to > work > > on real multi-threading > > > > Changes from v1: > > - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive > > - The way how the offset to access the bitmap is calculated has > > been improved and fixed > > - A page to be set as dirty requires a vCPU to target the protected > address > > and not just an address in the page > > - Addressed comments from Richard Henderson to improve the logic in > > softmmu_template.h and to simplify the methods generation through > > softmmu_llsc_template.h > > - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386 > > > > This work has been sponsored by Huawei Technologies Duesseldorf GmbH. > > > > Alvise Rigo (14): > > exec.c: Add new exclusive bitmap to ram_list > > softmmu: Add new TLB_EXCL flag > > Add CPUClass hook to set exclusive range > > softmmu: Add helpers for a new slowpath > > tcg: Create new runtime helpers for excl accesses > > configure: Use slow-path for atomic only when the softmmu is enabled > > target-arm: translate: Use ld/st excl for atomic insns > > target-arm: Add atomic_clear helper for CLREX insn > > softmmu: Add history of excl accesses > > softmmu: Simplify helper_*_st_name, wrap unaligned code > > softmmu: Simplify helper_*_st_name, wrap MMIO code > > softmmu: Simplify helper_*_st_name, wrap RAM code > > softmmu: Include MMIO/invalid exclusive accesses > > softmmu: Protect MMIO exclusive range > > > > Makefile.target | 2 +- > > configure | 4 + > > cputlb.c | 67 ++++++++- > > exec.c | 8 +- > > include/exec/cpu-all.h | 8 ++ > > include/exec/cpu-defs.h | 1 + > > include/exec/helper-gen.h | 1 + > > include/exec/helper-proto.h | 1 + > > include/exec/helper-tcg.h | 1 + > > include/exec/memory.h | 4 +- > > include/exec/ram_addr.h | 76 ++++++++++ > > include/qom/cpu.h | 21 +++ > > qom/cpu.c | 7 + > > softmmu_llsc_template.h | 144 +++++++++++++++++++ > > softmmu_template.h | 338 > +++++++++++++++++++++++++++++++++----------- > > target-arm/helper.h | 2 + > > target-arm/op_helper.c | 6 + > > target-arm/translate.c | 102 ++++++++++++- > > tcg-llsc-helper.c | 109 ++++++++++++++ > > tcg-llsc-helper.h | 35 +++++ > > tcg/tcg-llsc-gen-helper.h | 32 +++++ > > tcg/tcg.h | 31 ++++ > > 22 files changed, 909 insertions(+), 91 deletions(-) > > create mode 100644 softmmu_llsc_template.h > > create mode 100644 tcg-llsc-helper.c > > create mode 100644 tcg-llsc-helper.h > > create mode 100644 tcg/tcg-llsc-gen-helper.h > > > -- > Alex Bennée >