All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
@ 2015-12-14  8:41 Alvise Rigo
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
                   ` (17 more replies)
  0 siblings, 18 replies; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

This is the sixth iteration of the patch series which applies to the
upstream branch of QEMU (v2.5.0-rc3).

Changes versus previous versions are at the bottom of this cover letter.

The code is also available at following repository:
https://git.virtualopensystems.com/dev/qemu-mt.git
branch:
slowpath-for-atomic-v6-no-mttcg

This patch series provides an infrastructure for atomic instruction
implementation in QEMU, thus offering a 'legacy' solution for
translating guest atomic instructions. Moreover, it can be considered as
a first step toward a multi-thread TCG.

The underlying idea is to provide new TCG helpers (sort of softmmu
helpers) that guarantee atomicity to some memory accesses or in general
a way to define memory transactions.

More specifically, the new softmmu helpers behave as LoadLink and
StoreConditional instructions, and are called from TCG code by means of
target specific helpers. This work includes the implementation for all
the ARM atomic instructions, see target-arm/op_helper.c.

The implementation heavily uses the software TLB together with a new
bitmap that has been added to the ram_list structure which flags, on a
per-CPU basis, all the memory pages that are in the middle of a LoadLink
(LL), StoreConditional (SC) operation.  Since all these pages can be
accessed directly through the fast-path and alter a vCPU's linked value,
the new bitmap has been coupled with a new TLB flag for the TLB virtual
address which forces the slow-path execution for all the accesses to a
page containing a linked address.

The new slow-path is implemented such that:
- the LL behaves as a normal load slow-path, except for clearing the
  dirty flag in the bitmap.  The cputlb.c code while generating a TLB
  entry, checks if there is at least one vCPU that has the bit cleared
  in the exclusive bitmap, it that case the TLB entry will have the EXCL
  flag set, thus forcing the slow-path.  In order to ensure that all the
  vCPUs will follow the slow-path for that page, we flush the TLB cache
  of all the other vCPUs.

  The LL will also set the linked address and size of the access in a
  vCPU's private variable. After the corresponding SC, this address will
  be set to a reset value.

- the SC can fail returning 1, or succeed, returning 0.  It has to come
  always after a LL and has to access the same address 'linked' by the
  previous LL, otherwise it will fail. If in the time window delimited
  by a legit pair of LL/SC operations another write access happens to
  the linked address, the SC will fail.

In theory, the provided implementation of TCG LoadLink/StoreConditional
can be used to properly handle atomic instructions on any architecture.

The code has been tested with bare-metal test cases and by booting Linux.

* Performance considerations
The new slow-path adds some overhead to the translation of the ARM
atomic instructions, since their emulation doesn't happen anymore only
in the guest (by mean of pure TCG generated code), but requires the
execution of two helpers functions. Despite this, the additional time
required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
negligible.
Instead, on a LL/SC bound test scenario - like:
https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
solution requires 30% (1 million iterations) and 70% (10 millions
iterations) of additional time for the test to complete.

Changes from v5:
- The exclusive memory region is now set through a CPUClass hook,
  allowing any architecture to decide the memory area that will be
  protected during a LL/SC operation [PATCH 3]
- The runtime helpers dropped any target dependency and are now in a
  common file [PATCH 5]
- Improved the way we restore a guest page as non-exclusive [PATCH 9]
- Included MMIO memory as possible target of LL/SC
  instructions. This also required to somehow simplify the
  helper_*_st_name helpers in softmmu_template.h [PATCH 8-14]

Changes from v4:
- Reworked the exclusive bitmap to be of fixed size (8 bits per address)
- The slow-path is now TCG backend independent, no need to touch
  tcg/* anymore as suggested by Aurelien Jarno.

Changes from v3:
- based on upstream QEMU
- addressed comments from Alex Bennée
- the slow path can be enabled by the user with:
  ./configure --enable-tcg-ldst-excl only if the backend supports it
- all the ARM ldex/stex instructions make now use of the slow path
- added aarch64 TCG backend support
- part of the code has been rewritten

Changes from v2:
- the bitmap accessors are now atomic
- a rendezvous between vCPUs and a simple callback support before executing
  a TB have been added to handle the TLB flush support
- the softmmu_template and softmmu_llsc_template have been adapted to work
  on real multi-threading

Changes from v1:
- The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
- The way how the offset to access the bitmap is calculated has
  been improved and fixed
- A page to be set as dirty requires a vCPU to target the protected address
  and not just an address in the page
- Addressed comments from Richard Henderson to improve the logic in
  softmmu_template.h and to simplify the methods generation through
  softmmu_llsc_template.h
- Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386

This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

Alvise Rigo (14):
  exec.c: Add new exclusive bitmap to ram_list
  softmmu: Add new TLB_EXCL flag
  Add CPUClass hook to set exclusive range
  softmmu: Add helpers for a new slowpath
  tcg: Create new runtime helpers for excl accesses
  configure: Use slow-path for atomic only when the softmmu is enabled
  target-arm: translate: Use ld/st excl for atomic insns
  target-arm: Add atomic_clear helper for CLREX insn
  softmmu: Add history of excl accesses
  softmmu: Simplify helper_*_st_name, wrap unaligned code
  softmmu: Simplify helper_*_st_name, wrap MMIO code
  softmmu: Simplify helper_*_st_name, wrap RAM code
  softmmu: Include MMIO/invalid exclusive accesses
  softmmu: Protect MMIO exclusive range

 Makefile.target             |   2 +-
 configure                   |   4 +
 cputlb.c                    |  67 ++++++++-
 exec.c                      |   8 +-
 include/exec/cpu-all.h      |   8 ++
 include/exec/cpu-defs.h     |   1 +
 include/exec/helper-gen.h   |   1 +
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   1 +
 include/exec/memory.h       |   4 +-
 include/exec/ram_addr.h     |  76 ++++++++++
 include/qom/cpu.h           |  21 +++
 qom/cpu.c                   |   7 +
 softmmu_llsc_template.h     | 144 +++++++++++++++++++
 softmmu_template.h          | 338 +++++++++++++++++++++++++++++++++-----------
 target-arm/helper.h         |   2 +
 target-arm/op_helper.c      |   6 +
 target-arm/translate.c      | 102 ++++++++++++-
 tcg-llsc-helper.c           | 109 ++++++++++++++
 tcg-llsc-helper.h           |  35 +++++
 tcg/tcg-llsc-gen-helper.h   |  32 +++++
 tcg/tcg.h                   |  31 ++++
 22 files changed, 909 insertions(+), 91 deletions(-)
 create mode 100644 softmmu_llsc_template.h
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2015-12-18 13:18   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag Alvise Rigo
                   ` (16 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

The purpose of this new bitmap is to flag the memory pages that are in
the middle of LL/SC operations (after a LL, before a SC) on a per-vCPU
basis.
For all these pages, the corresponding TLB entries will be generated
in such a way to force the slow-path if at least one vCPU has the bit
not set.
When the system starts, the whole memory is dirty (all the bitmap is
set). A page, after being marked as exclusively-clean, will be
restored as dirty after the SC.

For each page we keep 8 bits to be shared among all the vCPUs available
in the system. In general, the to the vCPU n correspond the bit n % 8.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 exec.c                  |  8 ++++--
 include/exec/memory.h   |  3 +-
 include/exec/ram_addr.h | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 84 insertions(+), 3 deletions(-)

diff --git a/exec.c b/exec.c
index 0bf0a6e..e66d232 100644
--- a/exec.c
+++ b/exec.c
@@ -1548,11 +1548,15 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
         int i;
 
         /* ram_list.dirty_memory[] is protected by the iothread lock.  */
-        for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
+        for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
             ram_list.dirty_memory[i] =
                 bitmap_zero_extend(ram_list.dirty_memory[i],
                                    old_ram_size, new_ram_size);
-       }
+        }
+        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] = bitmap_zero_extend(
+                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                old_ram_size * EXCL_BITMAP_CELL_SZ,
+                new_ram_size * EXCL_BITMAP_CELL_SZ);
     }
     cpu_physical_memory_set_dirty_range(new_block->offset,
                                         new_block->used_length,
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 0f07159..2782c77 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -19,7 +19,8 @@
 #define DIRTY_MEMORY_VGA       0
 #define DIRTY_MEMORY_CODE      1
 #define DIRTY_MEMORY_MIGRATION 2
-#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
+#define DIRTY_MEMORY_EXCLUSIVE 3
+#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
 
 #include <stdint.h>
 #include <stdbool.h>
diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
index 7115154..b48af27 100644
--- a/include/exec/ram_addr.h
+++ b/include/exec/ram_addr.h
@@ -21,6 +21,7 @@
 
 #ifndef CONFIG_USER_ONLY
 #include "hw/xen/xen.h"
+#include "sysemu/sysemu.h"
 
 struct RAMBlock {
     struct rcu_head rcu;
@@ -82,6 +83,13 @@ int qemu_ram_resize(ram_addr_t base, ram_addr_t newsize, Error **errp);
 #define DIRTY_CLIENTS_ALL     ((1 << DIRTY_MEMORY_NUM) - 1)
 #define DIRTY_CLIENTS_NOCODE  (DIRTY_CLIENTS_ALL & ~(1 << DIRTY_MEMORY_CODE))
 
+/* Exclusive bitmap support. */
+#define EXCL_BITMAP_CELL_SZ 8
+#define EXCL_BITMAP_GET_BIT_OFFSET(addr) \
+        (EXCL_BITMAP_CELL_SZ * (addr >> TARGET_PAGE_BITS))
+#define EXCL_BITMAP_GET_BYTE_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
+#define EXCL_IDX(cpu) (cpu % EXCL_BITMAP_CELL_SZ)
+
 static inline bool cpu_physical_memory_get_dirty(ram_addr_t start,
                                                  ram_addr_t length,
                                                  unsigned client)
@@ -173,6 +181,11 @@ static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
     if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
         bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
     }
+    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
+        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE],
+                        page * EXCL_BITMAP_CELL_SZ,
+                        (end - page) * EXCL_BITMAP_CELL_SZ);
+    }
     xen_modified_memory(start, length);
 }
 
@@ -288,5 +301,68 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
 }
 
 void migration_bitmap_extend(ram_addr_t old, ram_addr_t new);
+
+/* One cell for each page. The n-th bit of a cell describes all the i-th vCPUs
+ * such that (i % EXCL_BITMAP_CELL_SZ) == n.
+ * A bit set to zero ensures that all the vCPUs described by the bit have the
+ * EXCL_BIT set for the page. */
+static inline void cpu_physical_memory_unset_excl(ram_addr_t addr, uint32_t cpu)
+{
+    set_bit_atomic(EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu),
+            ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+}
+
+/* Return true if there is at least one cpu with the EXCL bit set for the page
+ * of @addr. */
+static inline int cpu_physical_memory_atleast_one_excl(ram_addr_t addr)
+{
+    uint8_t *bitmap;
+
+    bitmap = (uint8_t *)(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
+
+    /* This is safe even if smp_cpus < 8 since the unused bits are always 1. */
+    return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] != UCHAR_MAX;
+}
+
+/* Return true if the @cpu has the bit set (not exclusive) for the page of
+ * @addr.  If @cpu == smp_cpus return true if at least one vCPU has the dirty
+ * bit set for that page. */
+static inline int cpu_physical_memory_not_excl(ram_addr_t addr,
+                                               unsigned long cpu)
+{
+    uint8_t *bitmap;
+
+    bitmap = (uint8_t *)ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
+
+    if (cpu == smp_cpus) {
+        if (smp_cpus >= EXCL_BITMAP_CELL_SZ) {
+            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)];
+        } else {
+            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] &
+                                            ((1 << smp_cpus) - 1);
+        }
+    } else {
+        return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] & (1 << EXCL_IDX(cpu));
+    }
+}
+
+/* Clean the dirty bit of @cpu (i.e. set the page as exclusive). If @cpu ==
+ * smp_cpus clean the dirty bit for all the vCPUs. */
+static inline int cpu_physical_memory_set_excl(ram_addr_t addr, uint32_t cpu)
+{
+    if (cpu == smp_cpus) {
+        int nr = (smp_cpus >= EXCL_BITMAP_CELL_SZ) ?
+                            EXCL_BITMAP_CELL_SZ : smp_cpus;
+
+        return bitmap_test_and_clear_atomic(
+                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                        EXCL_BITMAP_GET_BIT_OFFSET(addr), nr);
+    } else {
+        return bitmap_test_and_clear_atomic(
+                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
+                        EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu), 1);
+    }
+}
+
 #endif
 #endif
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel]  [RFC v6 02/14] softmmu: Add new TLB_EXCL flag
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2016-01-05 16:10   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 03/14] Add CPUClass hook to set exclusive range Alvise Rigo
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Add a new TLB flag to force all the accesses made to a page to follow
the slow-path.

In the case we remove a TLB entry marked as EXCL, we unset the
corresponding exclusive bit in the bitmap.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |  38 +++++++++++++++-
 include/exec/cpu-all.h  |   8 ++++
 include/exec/cpu-defs.h |   1 +
 include/qom/cpu.h       |  14 ++++++
 softmmu_template.h      | 114 ++++++++++++++++++++++++++++++++++++++----------
 5 files changed, 152 insertions(+), 23 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index bf1d50a..7ee0c89 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -394,6 +394,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     env->tlb_v_table[mmu_idx][vidx] = *te;
     env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
 
+    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL))) {
+        /* We are removing an exclusive entry, set the page to dirty. This
+         * is not be necessary if the vCPU has performed both SC and LL. */
+        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
+                                          (te->addr_write & TARGET_PAGE_MASK);
+        if (!cpu->ll_sc_context) {
+            cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
+        }
+    }
+
     /* refill the tlb */
     env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
     env->iotlb[mmu_idx][index].attrs = attrs;
@@ -419,7 +429,15 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
                                                    + xlat)) {
             te->addr_write = address | TLB_NOTDIRTY;
         } else {
-            te->addr_write = address;
+            if (!(address & TLB_MMIO) &&
+                cpu_physical_memory_atleast_one_excl(section->mr->ram_addr
+                                                           + xlat)) {
+                /* There is at least one vCPU that has flagged the address as
+                 * exclusive. */
+                te->addr_write = address | TLB_EXCL;
+            } else {
+                te->addr_write = address;
+            }
         }
     } else {
         te->addr_write = -1;
@@ -471,6 +489,24 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
     return qemu_ram_addr_from_host_nofail(p);
 }
 
+/* For every vCPU compare the exclusive address and reset it in case of a
+ * match. Since only one vCPU is running at once, no lock has to be held to
+ * guard this operation. */
+static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
+{
+    CPUState *cpu;
+
+    CPU_FOREACH(cpu) {
+        if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+            ranges_overlap(cpu->excl_protected_range.begin,
+                           cpu->excl_protected_range.end -
+                           cpu->excl_protected_range.begin,
+                           addr, size)) {
+            cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+        }
+    }
+}
+
 #define MMUSUFFIX _mmu
 
 #define SHIFT 0
diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
index 83b1781..f8d8feb 100644
--- a/include/exec/cpu-all.h
+++ b/include/exec/cpu-all.h
@@ -277,6 +277,14 @@ CPUArchState *cpu_copy(CPUArchState *env);
 #define TLB_NOTDIRTY    (1 << 4)
 /* Set if TLB entry is an IO callback.  */
 #define TLB_MMIO        (1 << 5)
+/* Set if TLB entry references a page that requires exclusive access.  */
+#define TLB_EXCL        (1 << 6)
+
+/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
+ * above. */
+#if TLB_EXCL >= TARGET_PAGE_SIZE
+#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
+#endif
 
 void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
 void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
index 5093be2..b34d7ae 100644
--- a/include/exec/cpu-defs.h
+++ b/include/exec/cpu-defs.h
@@ -27,6 +27,7 @@
 #include <inttypes.h>
 #include "qemu/osdep.h"
 #include "qemu/queue.h"
+#include "qemu/range.h"
 #include "tcg-target.h"
 #ifndef CONFIG_USER_ONLY
 #include "exec/hwaddr.h"
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 51a1323..c6bb6b6 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -29,6 +29,7 @@
 #include "qemu/queue.h"
 #include "qemu/thread.h"
 #include "qemu/typedefs.h"
+#include "qemu/range.h"
 
 typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
                                      void *opaque);
@@ -210,6 +211,9 @@ struct kvm_run;
 #define TB_JMP_CACHE_BITS 12
 #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
 
+/* Atomic insn translation TLB support. */
+#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+
 /**
  * CPUState:
  * @cpu_index: CPU index (informative).
@@ -329,6 +333,16 @@ struct CPUState {
      */
     bool throttle_thread_scheduled;
 
+    /* Used by the atomic insn translation backend. */
+    int ll_sc_context;
+    /* vCPU current exclusive addresses range.
+     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
+     * in the middle of a LL/SC. */
+    struct Range excl_protected_range;
+    /* Used to carry the SC result but also to flag a normal (legacy)
+     * store access made by a stcond (see softmmu_template.h). */
+    int excl_succeeded;
+
     /* Note that this is accessed at the start of every TB via a negative
        offset from AREG0.  Leave this field at the end so as to make the
        (absolute value) offset as small as possible.  This reduces code
diff --git a/softmmu_template.h b/softmmu_template.h
index 6803890..24d29b7 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -395,19 +395,54 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            CPUState *cpu = ENV_GET_CPU(env);
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            /* The function lookup_and_reset_cpus_ll_addr could have reset the
+             * exclusive address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
+             * not been called by a softmmu_llsc_template.h. */
+            if (cpu->excl_succeeded) {
+                if (cpu->excl_protected_range.begin != hw_addr) {
+                    /* The vCPU is SC-ing to an unprotected address. */
+                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+                    cpu->excl_succeeded = 0;
+
+                    return;
+                }
+
+                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
+            }
+
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+        #if DATA_SIZE == 1
+            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+        #else
+            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+        #endif
+
+            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
+
+            return;
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+            iotlbentry = &env->iotlb[mmu_idx][index];
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io.  */
+            val = TGT_LE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_LE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
     }
 
     /* Handle slow unaligned access (it spans two pages or IO).  */
@@ -475,19 +510,54 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
     }
 
-    /* Handle an IO access.  */
+    /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry;
-        if ((addr & (DATA_SIZE - 1)) != 0) {
-            goto do_unaligned_access;
+        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+            CPUState *cpu = ENV_GET_CPU(env);
+            /* The slow-path has been forced since we are writing to
+             * exclusive-protected memory. */
+            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
+
+            /* The function lookup_and_reset_cpus_ll_addr could have reset the
+             * exclusive address. Fail the SC in this case.
+             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
+             * not been called by a softmmu_llsc_template.h. */
+            if (cpu->excl_succeeded) {
+                if (cpu->excl_protected_range.begin != hw_addr) {
+                    /* The vCPU is SC-ing to an unprotected address. */
+                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+                    cpu->excl_succeeded = 0;
+
+                    return;
+                }
+
+                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
+            }
+
+            haddr = addr + env->tlb_table[mmu_idx][index].addend;
+        #if DATA_SIZE == 1
+            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+        #else
+            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+        #endif
+
+            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
+
+            return;
+        } else {
+            if ((addr & (DATA_SIZE - 1)) != 0) {
+                goto do_unaligned_access;
+            }
+            iotlbentry = &env->iotlb[mmu_idx][index];
+
+            /* ??? Note that the io helpers always read data in the target
+               byte ordering.  We should push the LE/BE request down into io.  */
+            val = TGT_BE(val);
+            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            return;
         }
-        iotlbentry = &env->iotlb[mmu_idx][index];
-
-        /* ??? Note that the io helpers always read data in the target
-           byte ordering.  We should push the LE/BE request down into io.  */
-        val = TGT_BE(val);
-        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
-        return;
     }
 
     /* Handle slow unaligned access (it spans two pages or IO).  */
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 03/14] Add CPUClass hook to set exclusive range
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2016-01-05 16:42   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 04/14] softmmu: Add helpers for a new slowpath Alvise Rigo
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Allow each architecture to set the exclusive range at any LoadLink
operation through a CPUClass hook.
This comes in handy to emulate, for instance, the exclusive monitor
implemented in some ARM architectures (more precisely, the Exclusive
Reservation Granule).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 include/qom/cpu.h | 4 ++++
 qom/cpu.c         | 7 +++++++
 2 files changed, 11 insertions(+)

diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index c6bb6b6..9e409ce 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -175,6 +175,10 @@ typedef struct CPUClass {
     void (*cpu_exec_exit)(CPUState *cpu);
     bool (*cpu_exec_interrupt)(CPUState *cpu, int interrupt_request);
 
+    /* Atomic instruction handling */
+    void (*cpu_set_excl_protected_range)(CPUState *cpu, hwaddr addr,
+                                         hwaddr size);
+
     void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
 } CPUClass;
 
diff --git a/qom/cpu.c b/qom/cpu.c
index fb80d13..a5c25a8 100644
--- a/qom/cpu.c
+++ b/qom/cpu.c
@@ -203,6 +203,12 @@ static bool cpu_common_exec_interrupt(CPUState *cpu, int int_req)
     return false;
 }
 
+static void cpu_common_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
+{
+    cpu->excl_protected_range.begin = addr;
+    cpu->excl_protected_range.end = addr + size;
+}
+
 void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
                     int flags)
 {
@@ -355,6 +361,7 @@ static void cpu_class_init(ObjectClass *klass, void *data)
     k->cpu_exec_enter = cpu_common_noop;
     k->cpu_exec_exit = cpu_common_noop;
     k->cpu_exec_interrupt = cpu_common_exec_interrupt;
+    k->cpu_set_excl_protected_range = cpu_common_set_excl_range;
     dc->realize = cpu_common_realizefn;
     /*
      * Reason: CPUs still need special care by board code: wiring up
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 04/14] softmmu: Add helpers for a new slowpath
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (2 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 03/14] Add CPUClass hook to set exclusive range Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2016-01-06 15:16   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 05/14] tcg: Create new runtime helpers for excl accesses Alvise Rigo
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

The new helpers rely on the legacy ones to perform the actual read/write.

The LoadLink helper (helper_ldlink_name) prepares the way for the
following SC operation. It sets the linked address and the size of the
access.
These helper also update the TLB entry of the page involved in the
LL/SC for those vCPUs that have the bit set (dirty), so that the
following accesses made by all the vCPUs will follow the slow path.

The StoreConditional helper (helper_stcond_name) returns 1 if the
store has to fail due to a concurrent access to the same page by
another vCPU. A 'concurrent access' can be a store made by *any* vCPU
(although, some implementations allow stores made by the CPU that issued
the LoadLink).

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                |   3 ++
 softmmu_llsc_template.h | 134 ++++++++++++++++++++++++++++++++++++++++++++++++
 softmmu_template.h      |  12 +++++
 tcg/tcg.h               |  31 +++++++++++
 4 files changed, 180 insertions(+)
 create mode 100644 softmmu_llsc_template.h

diff --git a/cputlb.c b/cputlb.c
index 7ee0c89..70b6404 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -509,6 +509,8 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 
 #define MMUSUFFIX _mmu
 
+/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
+#define GEN_EXCLUSIVE_HELPERS
 #define SHIFT 0
 #include "softmmu_template.h"
 
@@ -521,6 +523,7 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 #define SHIFT 3
 #include "softmmu_template.h"
 #undef MMUSUFFIX
+#undef GEN_EXCLUSIVE_HELPERS
 
 #define MMUSUFFIX _cmmu
 #undef GETPC_ADJ
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
new file mode 100644
index 0000000..586bb2e
--- /dev/null
+++ b/softmmu_llsc_template.h
@@ -0,0 +1,134 @@
+/*
+ *  Software MMU support (esclusive load/store operations)
+ *
+ * Generate helpers used by TCG for qemu_ldlink/stcond ops.
+ *
+ * Included from softmmu_template.h only.
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.rigo@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+/* This template does not generate together the le and be version, but only one
+ * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
+ * The same nomenclature as softmmu_template.h is used for the exclusive
+ * helpers.  */
+
+#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
+
+#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
+
+#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
+
+#if DATA_SIZE > 1
+#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
+#else /* DATA_SIZE <= 1 */
+#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
+#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
+#define helper_ld glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
+#define helper_st glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
+#endif
+
+#endif
+
+WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
+                                TCGMemOpIdx oi, uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    int index;
+    CPUState *cpu, *this = ENV_GET_CPU(env);
+    CPUClass *cc = CPU_GET_CLASS(this);
+    hwaddr hw_addr;
+    unsigned mmu_idx = get_mmuidx(oi);
+
+    /* Use the proper load helper from cpu_ldst.h */
+    ret = helper_ld(env, addr, mmu_idx, retaddr);
+
+    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
+
+    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
+     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
+    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
+
+    cpu_physical_memory_set_excl(hw_addr, this->cpu_index);
+    /* If all the vCPUs have the EXCL bit set for this page there is no need
+     * to request any flush. */
+    if (cpu_physical_memory_not_excl(hw_addr, smp_cpus)) {
+        CPU_FOREACH(cpu) {
+            if (current_cpu != cpu) {
+                if (cpu_physical_memory_not_excl(hw_addr, cpu->cpu_index)) {
+                    cpu_physical_memory_set_excl(hw_addr, cpu->cpu_index);
+                    tlb_flush(cpu, 1);
+                }
+            }
+        }
+    }
+
+    cc->cpu_set_excl_protected_range(this, hw_addr, DATA_SIZE);
+
+    /* For this vCPU, just update the TLB entry, no need to flush. */
+    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
+
+    /* From now on we are in LL/SC context */
+    this->ll_sc_context = 1;
+
+    return ret;
+}
+
+WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
+                             DATA_TYPE val, TCGMemOpIdx oi,
+                             uintptr_t retaddr)
+{
+    WORD_TYPE ret;
+    unsigned mmu_idx = get_mmuidx(oi);
+    CPUState *cpu = ENV_GET_CPU(env);
+
+    if (!cpu->ll_sc_context) {
+        cpu->excl_succeeded = 0;
+        ret = 1;
+    } else {
+        /* We set it preventively to true to distinguish the following legacy
+         * access as one made by the store conditional wrapper. If the store
+         * conditional does not succeed, the value will be set to 0.*/
+        cpu->excl_succeeded = 1;
+        helper_st(env, addr, val, mmu_idx, retaddr);
+
+        if (cpu->excl_succeeded) {
+            cpu->excl_succeeded = 0;
+            ret = 0;
+        } else {
+            ret = 1;
+        }
+
+        /* Unset LL/SC context */
+        cpu->ll_sc_context = 0;
+    }
+
+    return ret;
+}
+
+#undef helper_ldlink_name
+#undef helper_stcond_name
+#undef helper_ld
+#undef helper_st
diff --git a/softmmu_template.h b/softmmu_template.h
index 24d29b7..d3d5902 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -620,6 +620,18 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
 #endif
 #endif /* !defined(SOFTMMU_CODE_ACCESS) */
 
+#ifdef GEN_EXCLUSIVE_HELPERS
+
+#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers */
+#define BIGENDIAN_EXCLUSIVE_HELPERS
+#include "softmmu_llsc_template.h"
+#undef BIGENDIAN_EXCLUSIVE_HELPERS
+#endif
+
+#include "softmmu_llsc_template.h"
+
+#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */
+
 #undef READ_ACCESS_TYPE
 #undef SHIFT
 #undef DATA_TYPE
diff --git a/tcg/tcg.h b/tcg/tcg.h
index a696922..3e050a4 100644
--- a/tcg/tcg.h
+++ b/tcg/tcg.h
@@ -968,6 +968,21 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
                                     TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
                            TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
+                                            TCGMemOpIdx oi, uintptr_t retaddr);
 
 /* Value sign-extended to tcg register size.  */
 tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
@@ -1010,6 +1025,22 @@ uint32_t helper_be_ldl_cmmu(CPUArchState *env, target_ulong addr,
                             TCGMemOpIdx oi, uintptr_t retaddr);
 uint64_t helper_be_ldq_cmmu(CPUArchState *env, target_ulong addr,
                             TCGMemOpIdx oi, uintptr_t retaddr);
+/* Exclusive variants */
+tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
+                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondw_mmu(CPUArchState *env, target_ulong addr,
+                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+tcg_target_ulong helper_be_stcondl_mmu(CPUArchState *env, target_ulong addr,
+                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+uint64_t helper_be_stcondq_mmu(CPUArchState *env, target_ulong addr,
+                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
+
 
 /* Temporary aliases until backends are converted.  */
 #ifdef TARGET_WORDS_BIGENDIAN
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 05/14] tcg: Create new runtime helpers for excl accesses
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (3 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 04/14] softmmu: Add helpers for a new slowpath Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2015-12-14  9:40   ` Paolo Bonzini
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Introduce a set of new runtime helpers do handle exclusive instructions.
This helpers are used as hooks to call the respective LL/SC helpers in
softmmu_llsc_template.h from TCG code.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 Makefile.target             |   2 +-
 include/exec/helper-gen.h   |   1 +
 include/exec/helper-proto.h |   1 +
 include/exec/helper-tcg.h   |   1 +
 tcg-llsc-helper.c           | 109 ++++++++++++++++++++++++++++++++++++++++++++
 tcg-llsc-helper.h           |  35 ++++++++++++++
 tcg/tcg-llsc-gen-helper.h   |  32 +++++++++++++
 7 files changed, 180 insertions(+), 1 deletion(-)
 create mode 100644 tcg-llsc-helper.c
 create mode 100644 tcg-llsc-helper.h
 create mode 100644 tcg/tcg-llsc-gen-helper.h

diff --git a/Makefile.target b/Makefile.target
index 962d004..e9fbcdc 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -135,7 +135,7 @@ obj-y += arch_init.o cpus.o monitor.o gdbstub.o balloon.o ioport.o numa.o
 obj-y += qtest.o bootdevice.o
 obj-y += hw/
 obj-$(CONFIG_KVM) += kvm-all.o
-obj-y += memory.o cputlb.o
+obj-y += memory.o cputlb.o tcg-llsc-helper.o
 obj-y += memory_mapping.o
 obj-y += dump.o
 obj-y += migration/ram.o migration/savevm.o
diff --git a/include/exec/helper-gen.h b/include/exec/helper-gen.h
index 0d0da3a..d45cdad2 100644
--- a/include/exec/helper-gen.h
+++ b/include/exec/helper-gen.h
@@ -60,6 +60,7 @@ static inline void glue(gen_helper_, name)(dh_retvar_decl(ret)          \
 #include "trace/generated-helpers.h"
 #include "trace/generated-helpers-wrappers.h"
 #include "tcg-runtime.h"
+#include "tcg-llsc-gen-helper.h"
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-proto.h b/include/exec/helper-proto.h
index effdd43..90be2fd 100644
--- a/include/exec/helper-proto.h
+++ b/include/exec/helper-proto.h
@@ -29,6 +29,7 @@ dh_ctype(ret) HELPER(name) (dh_ctype(t1), dh_ctype(t2), dh_ctype(t3), \
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#include "tcg/tcg-llsc-gen-helper.h"
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/include/exec/helper-tcg.h b/include/exec/helper-tcg.h
index 79fa3c8..0c0dc71 100644
--- a/include/exec/helper-tcg.h
+++ b/include/exec/helper-tcg.h
@@ -38,6 +38,7 @@
 #include "helper.h"
 #include "trace/generated-helpers.h"
 #include "tcg-runtime.h"
+#include "tcg-llsc-gen-helper.h"
 
 #undef DEF_HELPER_FLAGS_0
 #undef DEF_HELPER_FLAGS_1
diff --git a/tcg-llsc-helper.c b/tcg-llsc-helper.c
new file mode 100644
index 0000000..04cc8b1
--- /dev/null
+++ b/tcg-llsc-helper.c
@@ -0,0 +1,109 @@
+/*
+ * Runtime helpers for atomic istruction emulation
+ *
+ * Copyright (c) 2015 Virtual Open Systems
+ *
+ * Authors:
+ *  Alvise Rigo <a.rigo@virtualopensystems.com>
+ *
+ * This library is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU Lesser General Public
+ * License as published by the Free Software Foundation; either
+ * version 2 of the License, or (at your option) any later version.
+ *
+ * This library is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * Lesser General Public License for more details.
+ *
+ * You should have received a copy of the GNU Lesser General Public
+ * License along with this library; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "exec/cpu_ldst.h"
+#include "exec/helper-head.h"
+#include "tcg-llsc-helper.h"
+
+#define LDEX_HELPER(SUFF, OPC, FUNC)                                    \
+uint32_t HELPER(ldlink_aa32_i##SUFF)(CPUArchState *env, uint32_t addr,  \
+                                                       uint32_t index)  \
+{                                                                       \
+    CPUArchState *state = env;                                          \
+    TCGMemOpIdx op;                                                     \
+                                                                        \
+    op = make_memop_idx(OPC, index);                                    \
+                                                                        \
+    return (uint32_t)FUNC(state, addr, op, GETRA());                    \
+}
+
+LDEX_HELPER(8, MO_UB, helper_ret_ldlinkub_mmu)
+
+LDEX_HELPER(16_be, MO_BEUW, helper_be_ldlinkuw_mmu)
+LDEX_HELPER(32_be, MO_BEUL, helper_be_ldlinkul_mmu)
+
+uint64_t HELPER(ldlink_aa32_i64_be)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index)
+{
+    CPUArchState *state = env;
+    TCGMemOpIdx op;
+
+    op = make_memop_idx(MO_BEQ, index);
+
+    return helper_be_ldlinkq_mmu(state, addr, op, GETRA());
+}
+
+LDEX_HELPER(16_le, MO_LEUW, helper_le_ldlinkuw_mmu)
+LDEX_HELPER(32_le, MO_LEUL, helper_le_ldlinkq_mmu)
+
+uint64_t HELPER(ldlink_aa32_i64_le)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index)
+{
+    CPUArchState *state = env;
+    TCGMemOpIdx op;
+
+    op = make_memop_idx(MO_LEQ, index);
+
+    return helper_le_ldlinkq_mmu(state, addr, op, GETRA());
+}
+
+#define STEX_HELPER(SUFF, DATA_TYPE, OPC, FUNC)                         \
+uint32_t HELPER(stcond_aa32_i##SUFF)(CPUArchState *env, uint32_t addr,  \
+                                     uint32_t val, uint32_t index)      \
+{                                                                       \
+    CPUArchState *state = env;                                          \
+    TCGMemOpIdx op;                                                     \
+                                                                        \
+    op = make_memop_idx(OPC, index);                                    \
+                                                                        \
+    return (uint32_t)FUNC(state, addr, val, op, GETRA());               \
+}
+
+STEX_HELPER(8, uint8_t, MO_UB, helper_ret_stcondb_mmu)
+
+STEX_HELPER(16_be, uint16_t, MO_BEUW, helper_be_stcondw_mmu)
+STEX_HELPER(32_be, uint32_t, MO_BEUL, helper_be_stcondl_mmu)
+
+uint32_t HELPER(stcond_aa32_i64_be)(CPUArchState *env, uint32_t addr,
+                                    uint64_t val, uint32_t index)
+{
+    CPUArchState *state = env;
+    TCGMemOpIdx op;
+
+    op = make_memop_idx(MO_BEQ, index);
+
+    return (uint32_t)helper_be_stcondq_mmu(state, addr, val, op, GETRA());
+}
+
+STEX_HELPER(16_le, uint16_t, MO_LEUW, helper_le_stcondw_mmu)
+STEX_HELPER(32_le, uint32_t, MO_LEUL, helper_le_stcondl_mmu)
+
+uint32_t HELPER(stcond_aa32_i64_le)(CPUArchState *env, uint32_t addr,
+                                    uint64_t val, uint32_t index)
+{
+    CPUArchState *state = env;
+    TCGMemOpIdx op;
+
+    op = make_memop_idx(MO_LEQ, index);
+
+    return (uint32_t)helper_le_stcondq_mmu(state, addr, val, op, GETRA());
+}
diff --git a/tcg-llsc-helper.h b/tcg-llsc-helper.h
new file mode 100644
index 0000000..bbe42c3
--- /dev/null
+++ b/tcg-llsc-helper.h
@@ -0,0 +1,35 @@
+#ifndef HELPER_LLSC_HEAD_H
+#define HELPER_LLSC_HEAD_H 1
+
+uint32_t HELPER(ldlink_aa32_i8)(CPUArchState *env, uint32_t addr,
+                                uint32_t index);
+uint32_t HELPER(ldlink_aa32_i16_be)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index);
+uint32_t HELPER(ldlink_aa32_i32_be)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index);
+uint64_t HELPER(ldlink_aa32_i64_be)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index);
+uint32_t HELPER(ldlink_aa32_i16_le)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index);
+uint32_t HELPER(ldlink_aa32_i32_le)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index);
+uint64_t HELPER(ldlink_aa32_i64_le)(CPUArchState *env, uint32_t addr,
+                                    uint32_t index);
+
+
+uint32_t HELPER(stcond_aa32_i8)(CPUArchState *env, uint32_t addr,
+                                uint32_t val, uint32_t index);
+uint32_t HELPER(stcond_aa32_i16_be)(CPUArchState *env, uint32_t addr,
+                                    uint32_t val, uint32_t index);
+uint32_t HELPER(stcond_aa32_i32_be)(CPUArchState *env, uint32_t addr,
+                                    uint32_t val, uint32_t index);
+uint32_t HELPER(stcond_aa32_i64_be)(CPUArchState *env, uint32_t addr,
+                                    uint64_t val, uint32_t index);
+uint32_t HELPER(stcond_aa32_i16_le)(CPUArchState *env, uint32_t addr,
+                                    uint32_t val, uint32_t index);
+uint32_t HELPER(stcond_aa32_i32_le)(CPUArchState *env, uint32_t addr,
+                                    uint32_t val, uint32_t index);
+uint32_t HELPER(stcond_aa32_i64_le)(CPUArchState *env, uint32_t addr,
+                                    uint64_t val, uint32_t index);
+
+#endif
diff --git a/tcg/tcg-llsc-gen-helper.h b/tcg/tcg-llsc-gen-helper.h
new file mode 100644
index 0000000..2b647cd
--- /dev/null
+++ b/tcg/tcg-llsc-gen-helper.h
@@ -0,0 +1,32 @@
+DEF_HELPER_3(ldlink_aa32_i8, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i16_be, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i32_be, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i64_be, i64, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i16_le, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i32_le, i32, env, i32, i32)
+DEF_HELPER_3(ldlink_aa32_i64_le, i64, env, i32, i32)
+
+DEF_HELPER_4(stcond_aa32_i8, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i16_be, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i32_be, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i64_be, i32, env, i32, i64, i32)
+DEF_HELPER_4(stcond_aa32_i16_le, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i32_le, i32, env, i32, i32, i32)
+DEF_HELPER_4(stcond_aa32_i64_le, i32, env, i32, i64, i32)
+
+/* Convenient aliases */
+#ifdef TARGET_WORDS_BIGENDIAN
+#define gen_helper_stcond_aa32_i16 gen_helper_stcond_aa32_i16_be
+#define gen_helper_stcond_aa32_i32 gen_helper_stcond_aa32_i32_be
+#define gen_helper_stcond_aa32_i64 gen_helper_stcond_aa32_i64_be
+#define gen_helper_ldlink_aa32_i16 gen_helper_ldlink_aa32_i16_be
+#define gen_helper_ldlink_aa32_i32 gen_helper_ldlink_aa32_i32_be
+#define gen_helper_ldlink_aa32_i64 gen_helper_ldlink_aa32_i64_be
+#else
+#define gen_helper_stcond_aa32_i16 gen_helper_stcond_aa32_i16_le
+#define gen_helper_stcond_aa32_i32 gen_helper_stcond_aa32_i32_le
+#define gen_helper_stcond_aa32_i64 gen_helper_stcond_aa32_i64_le
+#define gen_helper_ldlink_aa32_i16 gen_helper_ldlink_aa32_i16_le
+#define gen_helper_ldlink_aa32_i32 gen_helper_ldlink_aa32_i32_le
+#define gen_helper_ldlink_aa32_i64 gen_helper_ldlink_aa32_i64_le
+#endif
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (4 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 05/14] tcg: Create new runtime helpers for excl accesses Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2015-12-14  9:38   ` Paolo Bonzini
  2015-12-14 10:14   ` Laurent Vivier
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 07/14] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Use the new slow path for atomic instruction translation when the
softmmu is enabled.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 configure | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/configure b/configure
index b9552fd..cc3891a 100755
--- a/configure
+++ b/configure
@@ -4794,6 +4794,7 @@ echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
 echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
+echo "use ld/st excl    $softmmu"
 echo "fdt support       $fdt"
 echo "preadv support    $preadv"
 echo "fdatasync         $fdatasync"
@@ -5186,6 +5187,9 @@ fi
 if test "$tcg_interpreter" = "yes" ; then
   echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
 fi
+if test "$softmmu" = "yes" ; then
+  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
+fi
 if test "$fdatasync" = "yes" ; then
   echo "CONFIG_FDATASYNC=y" >> $config_host_mak
 fi
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 07/14] target-arm: translate: Use ld/st excl for atomic insns
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (5 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2016-01-06 17:11   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn Alvise Rigo
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Use the new LL/SC runtime helpers to handle the ARM atomic
instructions in softmmu_llsc_template.h.

In general, the helper generator
gen_helper_{ldlink,stcond}_aa32_i{8,16,32,64}() calls the function
helper_{le,be}_{ldlink,stcond}{ub,uw,ulq}_mmu() implemented in
softmmu_llsc_template.h.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/translate.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 97 insertions(+), 4 deletions(-)

diff --git a/target-arm/translate.c b/target-arm/translate.c
index 5d22879..e88d8a3 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -64,8 +64,10 @@ TCGv_ptr cpu_env;
 static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
 static TCGv_i32 cpu_R[16];
 TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
+#ifndef CONFIG_TCG_USE_LDST_EXCL
 TCGv_i64 cpu_exclusive_addr;
 TCGv_i64 cpu_exclusive_val;
+#endif
 #ifdef CONFIG_USER_ONLY
 TCGv_i64 cpu_exclusive_test;
 TCGv_i32 cpu_exclusive_info;
@@ -98,10 +100,12 @@ void arm_translate_init(void)
     cpu_VF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, VF), "VF");
     cpu_ZF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, ZF), "ZF");
 
+#ifndef CONFIG_TCG_USE_LDST_EXCL
     cpu_exclusive_addr = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
     cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_val), "exclusive_val");
+#endif
 #ifdef CONFIG_USER_ONLY
     cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
         offsetof(CPUARMState, exclusive_test), "exclusive_test");
@@ -7414,15 +7418,59 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
     tcg_gen_or_i32(cpu_ZF, lo, hi);
 }
 
-/* Load/Store exclusive instructions are implemented by remembering
+/* If the softmmu is enabled, the translation of Load/Store exclusive
+ * instructions will rely on the gen_helper_{ldlink,stcond} helpers,
+ * offloading most of the work to the softmmu_llsc_template.h functions.
+
+   Otherwise, these instructions are implemented by remembering
    the value/address loaded, and seeing if these are the same
    when the store is performed. This should be sufficient to implement
    the architecturally mandated semantics, and avoids having to monitor
    regular stores.
 
-   In system emulation mode only one CPU will be running at once, so
-   this sequence is effectively atomic.  In user emulation mode we
-   throw an exception and handle the atomic operation elsewhere.  */
+   In user emulation mode we throw an exception and handle the atomic
+   operation elsewhere.  */
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
+                               TCGv_i32 addr, int size)
+ {
+    TCGv_i32 tmp = tcg_temp_new_i32();
+    TCGv_i32 mem_idx = tcg_temp_new_i32();
+
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
+
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_helper_ldlink_aa32_i8(tmp, cpu_env, addr, mem_idx);
+            break;
+        case 1:
+            gen_helper_ldlink_aa32_i16(tmp, cpu_env, addr, mem_idx);
+            break;
+        case 2:
+            gen_helper_ldlink_aa32_i32(tmp, cpu_env, addr, mem_idx);
+            break;
+        default:
+            abort();
+        }
+
+        store_reg(s, rt, tmp);
+    } else {
+        TCGv_i64 tmp64 = tcg_temp_new_i64();
+        TCGv_i32 tmph = tcg_temp_new_i32();
+
+        gen_helper_ldlink_aa32_i64(tmp64, cpu_env, addr, mem_idx);
+        tcg_gen_extr_i64_i32(tmp, tmph, tmp64);
+
+        store_reg(s, rt, tmp);
+        store_reg(s, rt2, tmph);
+
+        tcg_temp_free_i64(tmp64);
+    }
+
+    tcg_temp_free_i32(mem_idx);
+}
+#else
 static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
                                TCGv_i32 addr, int size)
 {
@@ -7461,10 +7509,14 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
     store_reg(s, rt, tmp);
     tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
 }
+#endif
 
 static void gen_clrex(DisasContext *s)
 {
+#ifdef CONFIG_TCG_USE_LDST_EXCL
+#else
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
+#endif
 }
 
 #ifdef CONFIG_USER_ONLY
@@ -7476,6 +7528,47 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                      size | (rd << 4) | (rt << 8) | (rt2 << 12));
     gen_exception_internal_insn(s, 4, EXCP_STREX);
 }
+#elif defined CONFIG_TCG_USE_LDST_EXCL
+static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
+                                TCGv_i32 addr, int size)
+{
+    TCGv_i32 tmp, mem_idx;
+
+    mem_idx = tcg_temp_new_i32();
+
+    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
+    tmp = load_reg(s, rt);
+
+    if (size != 3) {
+        switch (size) {
+        case 0:
+            gen_helper_stcond_aa32_i8(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
+            break;
+        case 1:
+            gen_helper_stcond_aa32_i16(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
+            break;
+        case 2:
+            gen_helper_stcond_aa32_i32(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
+            break;
+        default:
+            abort();
+        }
+    } else {
+        TCGv_i64 tmp64;
+        TCGv_i32 tmp2;
+
+        tmp64 = tcg_temp_new_i64();
+        tmp2 = load_reg(s, rt2);
+        tcg_gen_concat_i32_i64(tmp64, tmp, tmp2);
+        gen_helper_stcond_aa32_i64(cpu_R[rd], cpu_env, addr, tmp64, mem_idx);
+
+        tcg_temp_free_i32(tmp2);
+        tcg_temp_free_i64(tmp64);
+    }
+
+    tcg_temp_free_i32(tmp);
+    tcg_temp_free_i32(mem_idx);
+}
 #else
 static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
                                 TCGv_i32 addr, int size)
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (6 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 07/14] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2016-01-06 17:13   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses Alvise Rigo
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Add a simple helper function to emulate the CLREX instruction.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 target-arm/helper.h    | 2 ++
 target-arm/op_helper.c | 6 ++++++
 target-arm/translate.c | 1 +
 3 files changed, 9 insertions(+)

diff --git a/target-arm/helper.h b/target-arm/helper.h
index c2a85c7..37cec49 100644
--- a/target-arm/helper.h
+++ b/target-arm/helper.h
@@ -532,6 +532,8 @@ DEF_HELPER_2(dc_zva, void, env, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
 
+DEF_HELPER_1(atomic_clear, void, env)
+
 #ifdef TARGET_AARCH64
 #include "helper-a64.h"
 #endif
diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
index 6cd54c8..5a67557 100644
--- a/target-arm/op_helper.c
+++ b/target-arm/op_helper.c
@@ -50,6 +50,12 @@ static int exception_target_el(CPUARMState *env)
     return target_el;
 }
 
+void HELPER(atomic_clear)(CPUARMState *env)
+{
+    ENV_GET_CPU(env)->excl_protected_range.begin = -1;
+    ENV_GET_CPU(env)->ll_sc_context = false;
+}
+
 uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
                           uint32_t rn, uint32_t maxindex)
 {
diff --git a/target-arm/translate.c b/target-arm/translate.c
index e88d8a3..e0362e0 100644
--- a/target-arm/translate.c
+++ b/target-arm/translate.c
@@ -7514,6 +7514,7 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
 static void gen_clrex(DisasContext *s)
 {
 #ifdef CONFIG_TCG_USE_LDST_EXCL
+    gen_helper_atomic_clear(cpu_env);
 #else
     tcg_gen_movi_i64(cpu_exclusive_addr, -1);
 #endif
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (7 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2015-12-14  9:35   ` Paolo Bonzini
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Add a circular buffer to store the hw addresses used in the last
EXCLUSIVE_HISTORY_LEN exclusive accesses.

When an address is pop'ed from the buffer, its page will be set as not
exclusive. In this way, we avoid:
- frequent set/unset of a page (causing frequent flushes as well)
- the possibility to forget the EXCL bit set.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 32 ++++++++++++++++++++++----------
 include/qom/cpu.h       |  3 +++
 softmmu_llsc_template.h |  2 ++
 3 files changed, 27 insertions(+), 10 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 70b6404..372877e 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -394,16 +394,6 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
     env->tlb_v_table[mmu_idx][vidx] = *te;
     env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
 
-    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write & TLB_EXCL))) {
-        /* We are removing an exclusive entry, set the page to dirty. This
-         * is not be necessary if the vCPU has performed both SC and LL. */
-        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
-                                          (te->addr_write & TARGET_PAGE_MASK);
-        if (!cpu->ll_sc_context) {
-            cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
-        }
-    }
-
     /* refill the tlb */
     env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
     env->iotlb[mmu_idx][index].attrs = attrs;
@@ -507,6 +497,28 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
     }
 }
 
+static inline void excl_history_put_addr(CPUState *cpu, hwaddr addr)
+{
+    /* Avoid some overhead if the address we are about to put is equal to
+     * the last one */
+    if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
+                                    (addr & TARGET_PAGE_MASK)) {
+        cpu->excl_protected_last = (cpu->excl_protected_last + 1) %
+                                            EXCLUSIVE_HISTORY_LEN;
+        /* Unset EXCL bit of the oldest entry */
+        if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
+                                            EXCLUSIVE_RESET_ADDR) {
+            cpu_physical_memory_unset_excl(
+                cpu->excl_protected_addr[cpu->excl_protected_last],
+                cpu->cpu_index);
+        }
+
+        /* Add a new address, overwriting the oldest one */
+        cpu->excl_protected_addr[cpu->excl_protected_last] =
+                                            addr & TARGET_PAGE_MASK;
+    }
+}
+
 #define MMUSUFFIX _mmu
 
 /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
diff --git a/include/qom/cpu.h b/include/qom/cpu.h
index 9e409ce..5f65ebf 100644
--- a/include/qom/cpu.h
+++ b/include/qom/cpu.h
@@ -217,6 +217,7 @@ struct kvm_run;
 
 /* Atomic insn translation TLB support. */
 #define EXCLUSIVE_RESET_ADDR ULLONG_MAX
+#define EXCLUSIVE_HISTORY_LEN 8
 
 /**
  * CPUState:
@@ -343,6 +344,8 @@ struct CPUState {
      * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
      * in the middle of a LL/SC. */
     struct Range excl_protected_range;
+    hwaddr excl_protected_addr[EXCLUSIVE_HISTORY_LEN];
+    int excl_protected_last;
     /* Used to carry the SC result but also to flag a normal (legacy)
      * store access made by a stcond (see softmmu_template.h). */
     int excl_succeeded;
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index 586bb2e..becb90b 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -72,6 +72,7 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
     hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
 
     cpu_physical_memory_set_excl(hw_addr, this->cpu_index);
+    excl_history_put_addr(this, hw_addr);
     /* If all the vCPUs have the EXCL bit set for this page there is no need
      * to request any flush. */
     if (cpu_physical_memory_not_excl(hw_addr, smp_cpus)) {
@@ -80,6 +81,7 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
                 if (cpu_physical_memory_not_excl(hw_addr, cpu->cpu_index)) {
                     cpu_physical_memory_set_excl(hw_addr, cpu->cpu_index);
                     tlb_flush(cpu, 1);
+                    excl_history_put_addr(cpu, hw_addr);
                 }
             }
         }
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel]  [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (8 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2016-01-07 14:46   ` Alex Bennée
  2016-01-08 11:19   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code Alvise Rigo
                   ` (7 subsequent siblings)
  17 siblings, 2 replies; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Attempting to simplify the helper_*_st_name, wrap the
do_unaligned_access code into an inline function.
Remove also the goto statement.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_template.h | 96 ++++++++++++++++++++++++++++++++++--------------------
 1 file changed, 60 insertions(+), 36 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index d3d5902..92f92b1 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
                                  iotlbentry->attrs);
 }
 
+static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
+                                                           DATA_TYPE val,
+                                                           target_ulong addr,
+                                                           TCGMemOpIdx oi,
+                                                           unsigned mmu_idx,
+                                                           uintptr_t retaddr)
+{
+    int i;
+
+    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+    /* XXX: not efficient, but simple */
+    /* Note: relies on the fact that tlb_fill() does not remove the
+     * previous page from the TLB cache.  */
+    for (i = DATA_SIZE - 1; i >= 0; i--) {
+        /* Little-endian extract.  */
+        uint8_t val8 = val >> (i * 8);
+        /* Note the adjustment at the beginning of the function.
+           Undo that for the recursion.  */
+        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+                                        oi, retaddr + GETPC_ADJ);
+    }
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -433,7 +459,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             return;
         } else {
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                goto do_unaligned_access;
+                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+                                                        oi, retaddr);
             }
             iotlbentry = &env->iotlb[mmu_idx][index];
 
@@ -449,23 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     if (DATA_SIZE > 1
         && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
                      >= TARGET_PAGE_SIZE)) {
-        int i;
-    do_unaligned_access:
-        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                                 mmu_idx, retaddr);
-        }
-        /* XXX: not efficient, but simple */
-        /* Note: relies on the fact that tlb_fill() does not remove the
-         * previous page from the TLB cache.  */
-        for (i = DATA_SIZE - 1; i >= 0; i--) {
-            /* Little-endian extract.  */
-            uint8_t val8 = val >> (i * 8);
-            /* Note the adjustment at the beginning of the function.
-               Undo that for the recursion.  */
-            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-                                            oi, retaddr + GETPC_ADJ);
-        }
+        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
+                                                retaddr);
         return;
     }
 
@@ -485,6 +497,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 }
 
 #if DATA_SIZE > 1
+static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
+                                                           DATA_TYPE val,
+                                                           target_ulong addr,
+                                                           TCGMemOpIdx oi,
+                                                           unsigned mmu_idx,
+                                                           uintptr_t retaddr)
+{
+    int i;
+
+    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+    /* XXX: not efficient, but simple */
+    /* Note: relies on the fact that tlb_fill() does not remove the
+     * previous page from the TLB cache.  */
+    for (i = DATA_SIZE - 1; i >= 0; i--) {
+        /* Big-endian extract.  */
+        uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
+        /* Note the adjustment at the beginning of the function.
+           Undo that for the recursion.  */
+        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
+                                        oi, retaddr + GETPC_ADJ);
+    }
+}
+
 void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -548,7 +586,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
             return;
         } else {
             if ((addr & (DATA_SIZE - 1)) != 0) {
-                goto do_unaligned_access;
+                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+                                                        oi, retaddr);
             }
             iotlbentry = &env->iotlb[mmu_idx][index];
 
@@ -564,23 +603,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
     if (DATA_SIZE > 1
         && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
                      >= TARGET_PAGE_SIZE)) {
-        int i;
-    do_unaligned_access:
-        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                                 mmu_idx, retaddr);
-        }
-        /* XXX: not efficient, but simple */
-        /* Note: relies on the fact that tlb_fill() does not remove the
-         * previous page from the TLB cache.  */
-        for (i = DATA_SIZE - 1; i >= 0; i--) {
-            /* Big-endian extract.  */
-            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
-            /* Note the adjustment at the beginning of the function.
-               Undo that for the recursion.  */
-            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
-                                            oi, retaddr + GETPC_ADJ);
-        }
+        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
+                                                retaddr);
         return;
     }
 
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel]  [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (9 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2016-01-11  9:54   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code Alvise Rigo
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Attempting to simplify the helper_*_st_name, wrap the MMIO code into an
inline function.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_template.h | 64 +++++++++++++++++++++++++++++++++++++-----------------
 1 file changed, 44 insertions(+), 20 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 92f92b1..2ebf527 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -396,6 +396,26 @@ static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
     }
 }
 
+static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
+                                                            DATA_TYPE val,
+                                                            target_ulong addr,
+                                                            TCGMemOpIdx oi,
+                                                            unsigned mmu_idx,
+                                                            int index,
+                                                            uintptr_t retaddr)
+{
+    CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+    if ((addr & (DATA_SIZE - 1)) != 0) {
+        glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+                                                oi, retaddr);
+    }
+    /* ??? Note that the io helpers always read data in the target
+       byte ordering.  We should push the LE/BE request down into io.  */
+    val = TGT_LE(val);
+    glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -458,16 +478,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 
             return;
         } else {
-            if ((addr & (DATA_SIZE - 1)) != 0) {
-                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
-                                                        oi, retaddr);
-            }
-            iotlbentry = &env->iotlb[mmu_idx][index];
-
-            /* ??? Note that the io helpers always read data in the target
-               byte ordering.  We should push the LE/BE request down into io.  */
-            val = TGT_LE(val);
-            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
+                                                     mmu_idx, index, retaddr);
             return;
         }
     }
@@ -523,6 +535,26 @@ static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
     }
 }
 
+static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
+                                                            DATA_TYPE val,
+                                                            target_ulong addr,
+                                                            TCGMemOpIdx oi,
+                                                            unsigned mmu_idx,
+                                                            int index,
+                                                            uintptr_t retaddr)
+{
+    CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
+
+    if ((addr & (DATA_SIZE - 1)) != 0) {
+        glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
+                                                oi, retaddr);
+    }
+    /* ??? Note that the io helpers always read data in the target
+       byte ordering.  We should push the LE/BE request down into io.  */
+    val = TGT_BE(val);
+    glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+}
+
 void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
@@ -585,16 +617,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 
             return;
         } else {
-            if ((addr & (DATA_SIZE - 1)) != 0) {
-                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
-                                                        oi, retaddr);
-            }
-            iotlbentry = &env->iotlb[mmu_idx][index];
-
-            /* ??? Note that the io helpers always read data in the target
-               byte ordering.  We should push the LE/BE request down into io.  */
-            val = TGT_BE(val);
-            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
+            glue(helper_be_st_name, _do_mmio_access)(env, val, addr, oi,
+                                                     mmu_idx, index, retaddr);
             return;
         }
     }
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel]  [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (10 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2015-12-17 16:52   ` Alex Bennée
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 13/14] softmmu: Include MMIO/invalid exclusive accesses Alvise Rigo
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Attempting to simplify the helper_*_st_name, wrap the code relative to a
RAM access into an inline function.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 softmmu_template.h | 110 +++++++++++++++++++++++++++++++++--------------------
 1 file changed, 68 insertions(+), 42 deletions(-)

diff --git a/softmmu_template.h b/softmmu_template.h
index 2ebf527..262c95f 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -416,13 +416,46 @@ static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
     glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
 }
 
+static inline void glue(helper_le_st_name, _do_ram_access)(CPUArchState *env,
+                                                           DATA_TYPE val,
+                                                           target_ulong addr,
+                                                           TCGMemOpIdx oi,
+                                                           unsigned mmu_idx,
+                                                           int index,
+                                                           uintptr_t retaddr)
+{
+    uintptr_t haddr;
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                     >= TARGET_PAGE_SIZE)) {
+        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
+                                                retaddr);
+        return;
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
+    if ((addr & (DATA_SIZE - 1)) != 0
+        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+#if DATA_SIZE == 1
+    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
+#else
+    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
+#endif
+}
+
 void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
     unsigned mmu_idx = get_mmuidx(oi);
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-    uintptr_t haddr;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -484,28 +517,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         }
     }
 
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (DATA_SIZE > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
-                     >= TARGET_PAGE_SIZE)) {
-        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
-                                                retaddr);
-        return;
-    }
-
-    /* Handle aligned access or unaligned access in the same page.  */
-    if ((addr & (DATA_SIZE - 1)) != 0
-        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                             mmu_idx, retaddr);
-    }
-
-    haddr = addr + env->tlb_table[mmu_idx][index].addend;
-#if DATA_SIZE == 1
-    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
-#else
-    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
-#endif
+    glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
+                                            retaddr);
 }
 
 #if DATA_SIZE > 1
@@ -555,13 +568,42 @@ static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
     glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
 }
 
+static inline void glue(helper_be_st_name, _do_ram_access)(CPUArchState *env,
+                                                           DATA_TYPE val,
+                                                           target_ulong addr,
+                                                           TCGMemOpIdx oi,
+                                                           unsigned mmu_idx,
+                                                           int index,
+                                                           uintptr_t retaddr)
+{
+    uintptr_t haddr;
+
+    /* Handle slow unaligned access (it spans two pages or IO).  */
+    if (DATA_SIZE > 1
+        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
+                     >= TARGET_PAGE_SIZE)) {
+        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
+                                                retaddr);
+        return;
+    }
+
+    /* Handle aligned access or unaligned access in the same page.  */
+    if ((addr & (DATA_SIZE - 1)) != 0
+        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
+        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
+                             mmu_idx, retaddr);
+    }
+
+    haddr = addr + env->tlb_table[mmu_idx][index].addend;
+    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
+}
+
 void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                        TCGMemOpIdx oi, uintptr_t retaddr)
 {
     unsigned mmu_idx = get_mmuidx(oi);
     int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
     target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
-    uintptr_t haddr;
 
     /* Adjust the given return address.  */
     retaddr -= GETPC_ADJ;
@@ -623,24 +665,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
         }
     }
 
-    /* Handle slow unaligned access (it spans two pages or IO).  */
-    if (DATA_SIZE > 1
-        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
-                     >= TARGET_PAGE_SIZE)) {
-        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
-                                                retaddr);
-        return;
-    }
-
-    /* Handle aligned access or unaligned access in the same page.  */
-    if ((addr & (DATA_SIZE - 1)) != 0
-        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
-        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
-                             mmu_idx, retaddr);
-    }
-
-    haddr = addr + env->tlb_table[mmu_idx][index].addend;
-    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
+    glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
+                                            retaddr);
 }
 #endif /* DATA_SIZE > 1 */
 
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 13/14] softmmu: Include MMIO/invalid exclusive accesses
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (11 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 14/14] softmmu: Protect MMIO exclusive range Alvise Rigo
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Enable exclusive accesses when the MMIO/invalid flag is set in the TLB
entry.
In case a LL access is done to MMIO memory, we treat it differently from
a RAM access in that we do not rely on the EXCL bitmap to flag the page
as exclusive. In fact, we don't even need the TLB_EXCL flag to force the
slow path, since it is always forced anyway.

This commit does not take care of invalidating an MMIO exclusive range from
other non-exclusive accesses i.e. CPU1 LoadLink to MMIO address X and
CPU2 writes to X. This will be addressed in the following commit.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 20 +++++++++++---------
 softmmu_llsc_template.h | 25 ++++++++++++++-----------
 softmmu_template.h      | 38 ++++++++++++++++++++------------------
 3 files changed, 45 insertions(+), 38 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 372877e..7c2669c 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -413,22 +413,24 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
         if ((memory_region_is_ram(section->mr) && section->readonly)
             || memory_region_is_romd(section->mr)) {
             /* Write access calls the I/O callback.  */
-            te->addr_write = address | TLB_MMIO;
+            address |= TLB_MMIO;
         } else if (memory_region_is_ram(section->mr)
                    && cpu_physical_memory_is_clean(section->mr->ram_addr
                                                    + xlat)) {
-            te->addr_write = address | TLB_NOTDIRTY;
-        } else {
-            if (!(address & TLB_MMIO) &&
-                cpu_physical_memory_atleast_one_excl(section->mr->ram_addr
-                                                           + xlat)) {
+            address |= TLB_NOTDIRTY;
+        }
+
+        /* Since the MMIO accesses follow always the slow path, we do not need
+         * to set any flag to trap the access */
+        if (!(address & TLB_MMIO)) {
+            if (cpu_physical_memory_atleast_one_excl(
+                                        section->mr->ram_addr + xlat)) {
                 /* There is at least one vCPU that has flagged the address as
                  * exclusive. */
-                te->addr_write = address | TLB_EXCL;
-            } else {
-                te->addr_write = address;
+                address |= TLB_EXCL;
             }
         }
+        te->addr_write = address;
     } else {
         te->addr_write = -1;
     }
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index becb90b..bbc820e 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -71,17 +71,20 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
      * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
     hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
 
-    cpu_physical_memory_set_excl(hw_addr, this->cpu_index);
-    excl_history_put_addr(this, hw_addr);
-    /* If all the vCPUs have the EXCL bit set for this page there is no need
-     * to request any flush. */
-    if (cpu_physical_memory_not_excl(hw_addr, smp_cpus)) {
-        CPU_FOREACH(cpu) {
-            if (current_cpu != cpu) {
-                if (cpu_physical_memory_not_excl(hw_addr, cpu->cpu_index)) {
-                    cpu_physical_memory_set_excl(hw_addr, cpu->cpu_index);
-                    tlb_flush(cpu, 1);
-                    excl_history_put_addr(cpu, hw_addr);
+    /* No need to flush for MMIO addresses, the slow path is always used */
+    if (likely(!(env->tlb_table[mmu_idx][index].addr_read & TLB_MMIO))) {
+        cpu_physical_memory_set_excl(hw_addr, this->cpu_index);
+        excl_history_put_addr(this, hw_addr);
+        /* If all the vCPUs have the EXCL bit set for this page there is no need
+         * to request any flush. */
+        if (cpu_physical_memory_not_excl(hw_addr, smp_cpus)) {
+            CPU_FOREACH(cpu) {
+                if (current_cpu != cpu) {
+                    if (cpu_physical_memory_not_excl(hw_addr, cpu->cpu_index)) {
+                        cpu_physical_memory_set_excl(hw_addr, cpu->cpu_index);
+                        tlb_flush(cpu, 1);
+                        excl_history_put_addr(cpu, hw_addr);
+                    }
                 }
             }
         }
diff --git a/softmmu_template.h b/softmmu_template.h
index 262c95f..196beec 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -476,9 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 
     /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
-
-        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+        if (tlb_addr & TLB_EXCL) {
+            CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
             CPUState *cpu = ENV_GET_CPU(env);
             /* The slow-path has been forced since we are writing to
              * exclusive-protected memory. */
@@ -500,12 +499,14 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                 cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
             }
 
-            haddr = addr + env->tlb_table[mmu_idx][index].addend;
-        #if DATA_SIZE == 1
-            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
-        #else
-            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
-        #endif
+            if (tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)) { /* MMIO access */
+                glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
+                                                         mmu_idx, index,
+                                                         retaddr);
+            } else {
+                glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi,
+                                                        mmu_idx, index,retaddr);
+            }
 
             lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
 
@@ -624,9 +625,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
 
     /* Handle an IO access or exclusive access.  */
     if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
-        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
-
-        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
+        if (tlb_addr & TLB_EXCL) {
+            CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
             CPUState *cpu = ENV_GET_CPU(env);
             /* The slow-path has been forced since we are writing to
              * exclusive-protected memory. */
@@ -648,12 +648,14 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                 cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
             }
 
-            haddr = addr + env->tlb_table[mmu_idx][index].addend;
-        #if DATA_SIZE == 1
-            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
-        #else
-            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
-        #endif
+            if (tlb_addr & ~(TARGET_PAGE_MASK | TLB_EXCL)) { /* MMIO access */
+                glue(helper_be_st_name, _do_mmio_access)(env, val, addr, oi,
+                                                         mmu_idx, index,
+                                                         retaddr);
+            } else {
+                glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi,
+                                                        mmu_idx, index,retaddr);
+            }
 
             lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
 
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* [Qemu-devel] [RFC v6 14/14] softmmu: Protect MMIO exclusive range
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (12 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 13/14] softmmu: Include MMIO/invalid exclusive accesses Alvise Rigo
@ 2015-12-14  8:41 ` Alvise Rigo
  2015-12-14  9:33 ` [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Paolo Bonzini
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 60+ messages in thread
From: Alvise Rigo @ 2015-12-14  8:41 UTC (permalink / raw)
  To: qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

As for the RAM case, also the MMIO exclusive ranges have to be protected
by other CPU's accesses. In order to do that, we flag the accessed
MemoryRegion to mark that an exclusive access has been performed and is
not concluded yet.
This flag will force the other CPUs to invalidate the exclusive range in
case of collision.

Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
---
 cputlb.c                | 20 +++++++++++++-------
 include/exec/memory.h   |  1 +
 softmmu_llsc_template.h | 11 ++++++++---
 softmmu_template.h      | 22 ++++++++++++++++++++++
 4 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/cputlb.c b/cputlb.c
index 7c2669c..7348c5f 100644
--- a/cputlb.c
+++ b/cputlb.c
@@ -484,19 +484,25 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
 /* For every vCPU compare the exclusive address and reset it in case of a
  * match. Since only one vCPU is running at once, no lock has to be held to
  * guard this operation. */
-static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
+static inline bool lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
 {
     CPUState *cpu;
+    bool ret = false;
 
     CPU_FOREACH(cpu) {
-        if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
-            ranges_overlap(cpu->excl_protected_range.begin,
-                           cpu->excl_protected_range.end -
-                           cpu->excl_protected_range.begin,
-                           addr, size)) {
-            cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+        if (current_cpu != cpu) {
+            if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
+                ranges_overlap(cpu->excl_protected_range.begin,
+                               cpu->excl_protected_range.end -
+                               cpu->excl_protected_range.begin,
+                               addr, size)) {
+                cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
+                ret = true;
+            }
         }
     }
+
+    return ret;
 }
 
 static inline void excl_history_put_addr(CPUState *cpu, hwaddr addr)
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 2782c77..80961c2 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -181,6 +181,7 @@ struct MemoryRegion {
     bool warning_printed; /* For reservations */
     bool flush_coalesced_mmio;
     bool global_locking;
+    bool pending_excl_access; /* A vCPU issued an exclusive access */
     uint8_t vga_logging_count;
     MemoryRegion *alias;
     hwaddr alias_offset;
diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
index bbc820e..07e18ce 100644
--- a/softmmu_llsc_template.h
+++ b/softmmu_llsc_template.h
@@ -88,13 +88,17 @@ WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
                 }
             }
         }
+        /* For this vCPU, just update the TLB entry, no need to flush. */
+        env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
+    } else {
+        /* Set a pending exclusive access in the MemoryRegion */
+        MemoryRegion *mr = iotlb_to_region(this,
+                                           env->iotlb[mmu_idx][index].addr);
+        mr->pending_excl_access = true;
     }
 
     cc->cpu_set_excl_protected_range(this, hw_addr, DATA_SIZE);
 
-    /* For this vCPU, just update the TLB entry, no need to flush. */
-    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
-
     /* From now on we are in LL/SC context */
     this->ll_sc_context = 1;
 
@@ -128,6 +132,7 @@ WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
 
         /* Unset LL/SC context */
         cpu->ll_sc_context = 0;
+        cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
     }
 
     return ret;
diff --git a/softmmu_template.h b/softmmu_template.h
index 196beec..65cce0a 100644
--- a/softmmu_template.h
+++ b/softmmu_template.h
@@ -360,6 +360,14 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
     MemoryRegion *mr = iotlb_to_region(cpu, physaddr);
 
     physaddr = (physaddr & TARGET_PAGE_MASK) + addr;
+
+    /* Invalidate the exclusive range that overlaps this access */
+    if (mr->pending_excl_access) {
+        if (lookup_and_reset_cpus_ll_addr(physaddr, 1 << SHIFT)) {
+            mr->pending_excl_access = false;
+        }
+    }
+
     if (mr != &io_mem_rom && mr != &io_mem_notdirty && !cpu->can_do_io) {
         cpu_io_recompile(cpu, retaddr);
     }
@@ -503,6 +511,13 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                 glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
                                                          mmu_idx, index,
                                                          retaddr);
+                /* N.B.: Here excl_succeeded == 1 means that this access comes
+                 * from an exclusive instruction. */
+                if (cpu->excl_succeeded) {
+                    MemoryRegion *mr = iotlb_to_region(cpu,
+                                            env->iotlb[mmu_idx][index].addr);
+                    mr->pending_excl_access = false;
+                }
             } else {
                 glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi,
                                                         mmu_idx, index,retaddr);
@@ -652,6 +667,13 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
                 glue(helper_be_st_name, _do_mmio_access)(env, val, addr, oi,
                                                          mmu_idx, index,
                                                          retaddr);
+                /* N.B.: Here excl_succeeded == 1 means that this access comes
+                 * from an exclusive instruction. */
+                if (cpu->excl_succeeded) {
+                    MemoryRegion *mr = iotlb_to_region(cpu,
+                                            env->iotlb[mmu_idx][index].addr);
+                    mr->pending_excl_access = false;
+                }
             } else {
                 glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi,
                                                         mmu_idx, index,retaddr);
-- 
2.6.4

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (13 preceding siblings ...)
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 14/14] softmmu: Protect MMIO exclusive range Alvise Rigo
@ 2015-12-14  9:33 ` Paolo Bonzini
  2015-12-14 10:04   ` alvise rigo
  2015-12-14 22:09 ` Andreas Tobler
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-14  9:33 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, Emilio G. Cota, jani.kokkonen, tech, alex.bennee, rth



On 14/12/2015 09:41, Alvise Rigo wrote:
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.

No, _in theory_ this implementation is wrong.  If a normal store can
make a concurrent LL-SC pair fail, it's provably _impossible_ to handle
LL/SC with a wait-free fast path for normal stores.

If we decide that it's "good enough", because the race is incredibly
rare and doesn't happen anyway for spinlocks, then fine.  But it should
be represented correctly in the commit messages.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses Alvise Rigo
@ 2015-12-14  9:35   ` Paolo Bonzini
  2015-12-15 14:26     ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-14  9:35 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, rth



On 14/12/2015 09:41, Alvise Rigo wrote:
> +static inline void excl_history_put_addr(CPUState *cpu, hwaddr addr)
> +{
> +    /* Avoid some overhead if the address we are about to put is equal to
> +     * the last one */
> +    if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
> +                                    (addr & TARGET_PAGE_MASK)) {
> +        cpu->excl_protected_last = (cpu->excl_protected_last + 1) %
> +                                            EXCLUSIVE_HISTORY_LEN;

Either use "&" here...

> +        /* Unset EXCL bit of the oldest entry */
> +        if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
> +                                            EXCLUSIVE_RESET_ADDR) {
> +            cpu_physical_memory_unset_excl(
> +                cpu->excl_protected_addr[cpu->excl_protected_last],
> +                cpu->cpu_index);
> +        }
> +
> +        /* Add a new address, overwriting the oldest one */
> +        cpu->excl_protected_addr[cpu->excl_protected_last] =
> +                                            addr & TARGET_PAGE_MASK;
> +    }
> +}
> +
>  #define MMUSUFFIX _mmu
>  
>  /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 9e409ce..5f65ebf 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -217,6 +217,7 @@ struct kvm_run;
>  
>  /* Atomic insn translation TLB support. */
>  #define EXCLUSIVE_RESET_ADDR ULLONG_MAX
> +#define EXCLUSIVE_HISTORY_LEN 8
>  
>  /**
>   * CPUState:
> @@ -343,6 +344,8 @@ struct CPUState {
>       * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
>       * in the middle of a LL/SC. */
>      struct Range excl_protected_range;
> +    hwaddr excl_protected_addr[EXCLUSIVE_HISTORY_LEN];
> +    int excl_protected_last;

... or make this an "unsigned int".  Otherwise the code will contain an
actual (and slow) modulo operation.

Paolo

>      /* Used to carry the SC result but also to flag a normal (legacy)
>       * store access made by a stcond (see softmmu_template.h). */
>      int excl_succeeded;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
@ 2015-12-14  9:38   ` Paolo Bonzini
  2015-12-14  9:39     ` Paolo Bonzini
  2015-12-14 10:14   ` Laurent Vivier
  1 sibling, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-14  9:38 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, rth



On 14/12/2015 09:41, Alvise Rigo wrote:
> Use the new slow path for atomic instruction translation when the
> softmmu is enabled.
> 
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  configure | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/configure b/configure
> index b9552fd..cc3891a 100755
> --- a/configure
> +++ b/configure
> @@ -4794,6 +4794,7 @@ echo "Install blobs     $blobs"
>  echo "KVM support       $kvm"
>  echo "RDMA support      $rdma"
>  echo "TCG interpreter   $tcg_interpreter"
> +echo "use ld/st excl    $softmmu"
>  echo "fdt support       $fdt"
>  echo "preadv support    $preadv"
>  echo "fdatasync         $fdatasync"
> @@ -5186,6 +5187,9 @@ fi
>  if test "$tcg_interpreter" = "yes" ; then
>    echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
>  fi
> +if test "$softmmu" = "yes" ; then
> +  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
> +fi

Just use CONFIG_SOFTMMU in translate.c, no?

A target other than ARM might need ll/sc in user-mode emulation as well.

Paolo

>  if test "$fdatasync" = "yes" ; then
>    echo "CONFIG_FDATASYNC=y" >> $config_host_mak
>  fi
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-12-14  9:38   ` Paolo Bonzini
@ 2015-12-14  9:39     ` Paolo Bonzini
  0 siblings, 0 replies; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-14  9:39 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, rth



On 14/12/2015 10:38, Paolo Bonzini wrote:
>> > +if test "$softmmu" = "yes" ; then
>> > +  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
>> > +fi
> Just use CONFIG_SOFTMMU in translate.c, no?
> 
> A target other than ARM might need ll/sc in user-mode emulation as well.

Sorry, that makes no sense.  Couldn't hit cancel in time. :)

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 05/14] tcg: Create new runtime helpers for excl accesses
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 05/14] tcg: Create new runtime helpers for excl accesses Alvise Rigo
@ 2015-12-14  9:40   ` Paolo Bonzini
  0 siblings, 0 replies; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-14  9:40 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: alex.bennee, jani.kokkonen, tech, claudio.fontana, rth



On 14/12/2015 09:41, Alvise Rigo wrote:
> diff --git a/tcg/tcg-llsc-gen-helper.h b/tcg/tcg-llsc-gen-helper.h
> new file mode 100644
> index 0000000..2b647cd
> --- /dev/null
> +++ b/tcg/tcg-llsc-gen-helper.h
> @@ -0,0 +1,32 @@
> +DEF_HELPER_3(ldlink_aa32_i8, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i16_be, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i32_be, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i64_be, i64, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i16_le, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i32_le, i32, env, i32, i32)
> +DEF_HELPER_3(ldlink_aa32_i64_le, i64, env, i32, i32)

"aa32" probably should be removed?

Paolo

> +DEF_HELPER_4(stcond_aa32_i8, i32, env, i32, i32, i32)

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14  9:33 ` [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Paolo Bonzini
@ 2015-12-14 10:04   ` alvise rigo
  2015-12-14 10:17     ` Paolo Bonzini
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2015-12-14 10:04 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson

Hi Paolo,


Thank you for your feedback.

On Mon, Dec 14, 2015 at 10:33 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
>
> On 14/12/2015 09:41, Alvise Rigo wrote:
> > In theory, the provided implementation of TCG LoadLink/StoreConditional
> > can be used to properly handle atomic instructions on any architecture.
>
> No, _in theory_ this implementation is wrong.  If a normal store can
> make a concurrent LL-SC pair fail, it's provably _impossible_ to handle
> LL/SC with a wait-free fast path for normal stores.
>
> If we decide that it's "good enough", because the race is incredibly
> rare and doesn't happen anyway for spinlocks, then fine.  But it should
> be represented correctly in the commit messages.


I did not yet commented extensively this issue since this is still the
"single-threaded" version of the patch series.
As soon as the next version of mttcg will be released, I will rebase
this series on top of the multi-threaded code.

In any case, what I proposed in the mttcg based v5 was:
- A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
This is done by querying a TLB flush to all (not exactly all...) the
CPUs. To be 100% safe, probably we should also wait that the flush is
actually performed
- A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
to check for possible collision with a "exclusive memory region"

Now, why the fact of querying the flush (and possibly ensuring that
the flush has been actually done) should not be enough?

Thank you,
alvise

>
>
> Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
  2015-12-14  9:38   ` Paolo Bonzini
@ 2015-12-14 10:14   ` Laurent Vivier
  2015-12-15 14:23     ` alvise rigo
  1 sibling, 1 reply; 60+ messages in thread
From: Laurent Vivier @ 2015-12-14 10:14 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth



On 14/12/2015 09:41, Alvise Rigo wrote:
> Use the new slow path for atomic instruction translation when the
> softmmu is enabled.
> 
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  configure | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/configure b/configure
> index b9552fd..cc3891a 100755
> --- a/configure
> +++ b/configure
> @@ -4794,6 +4794,7 @@ echo "Install blobs     $blobs"
>  echo "KVM support       $kvm"
>  echo "RDMA support      $rdma"
>  echo "TCG interpreter   $tcg_interpreter"
> +echo "use ld/st excl    $softmmu"
>  echo "fdt support       $fdt"
>  echo "preadv support    $preadv"
>  echo "fdatasync         $fdatasync"
> @@ -5186,6 +5187,9 @@ fi
>  if test "$tcg_interpreter" = "yes" ; then
>    echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
>  fi
> +if test "$softmmu" = "yes" ; then
> +  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
> +fi

why is this "$softmmu" and not "$target_softmmu" ?

>  if test "$fdatasync" = "yes" ; then
>    echo "CONFIG_FDATASYNC=y" >> $config_host_mak
>  fi
> 

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14 10:04   ` alvise rigo
@ 2015-12-14 10:17     ` Paolo Bonzini
  2015-12-15 13:59       ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-14 10:17 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson



On 14/12/2015 11:04, alvise rigo wrote:
> In any case, what I proposed in the mttcg based v5 was:
> - A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
> This is done by querying a TLB flush to all (not exactly all...) the
> CPUs. To be 100% safe, probably we should also wait that the flush is
> actually performed
> - A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
> to check for possible collision with a "exclusive memory region"
> 
> Now, why the fact of querying the flush (and possibly ensuring that
> the flush has been actually done) should not be enough?

There will always be a race where the normal store fails.  While I
haven't studied your code enough to do a constructive proof, it's enough
to prove the impossibility of what you're trying to do.  Mind, I also
believed for a long time that it was possible to do it!

If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
store, you can model this as a consensus problem.  For example, CPU 0
could propose that the subsequent SC succeeds, while CPU 1 proposes that
it fails.  The outcome of the SC instruction depends on who wins.

Therefore, implementing LL/SC problem requires---on both CPU 0 and CPU
1, and hence for both LL/SC and normal store---an atomic primitive with
consensus number >= 2.  Other than LL/SC itself, the commonly-available
operations satisfying this requirement are test-and-set (consensus
number 2) and compare-and-swap (infinite consensus number).  Normal
memory reads and writes (called "atomic registers" in multi-processing
research lingo) have consensus number 1; it's not enough.

If the host had LL/SC, CPU 1 could in principle delegate its side of the
consensus problem to the processor; but even that is not a solution
because processors constrain the instructions that can appear between
the load and the store, and this could cause an infinite sequence of
spurious failed SCs.  Another option is transactional memory, but it's
also too slow for normal stores.

The simplest solution is not to implement full LL/SC semantics; instead,
similar to linux-user, a SC operation can perform a cmpxchg from the
value fetched by LL to the argument of SC.  This bypasses the issue
because stores do not have to be instrumented at all, but it does mean
that the emulation suffers from the ABA problem.

TLB_EXCL is also a middle-ground, a little bit stronger than cmpxchg.
It's more complex and more accurate, but also not perfect.  Which is
okay, but has to be documented.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (14 preceding siblings ...)
  2015-12-14  9:33 ` [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Paolo Bonzini
@ 2015-12-14 22:09 ` Andreas Tobler
  2015-12-15  8:16   ` alvise rigo
  2015-12-17 16:06 ` Alex Bennée
  2016-01-06 18:00 ` Andrew Baumann
  17 siblings, 1 reply; 60+ messages in thread
From: Andreas Tobler @ 2015-12-14 22:09 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Alvise,

On 14.12.15 09:41, Alvise Rigo wrote:
> This is the sixth iteration of the patch series which applies to the
> upstream branch of QEMU (v2.5.0-rc3).
>
> Changes versus previous versions are at the bottom of this cover letter.
>
> The code is also available at following repository:
> https://git.virtualopensystems.com/dev/qemu-mt.git
> branch:
> slowpath-for-atomic-v6-no-mttcg

Thank you very much for this work. I tried to rebase myself, but it was 
over my head.

I'm looking for a qemu solution where I can use my cores.

My use case is doing gcc porting for aarch64-*-freebsd*. I think it 
doesn't matter which OS. This arch has not enough real affordable HW 
solutions on the market yet. So I was looking for your solution. Claudio 
gave me a hint about it.

Your recent merge/rebase only covers arm itself, not aarch64, right?

Linking fails with unreferenced cpu_exclusive_addr stuff in 
target-arm/translate-a64.c

Are you working on this already? Or Claudio?

> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.

...

Thank you!
Andreas

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14 22:09 ` Andreas Tobler
@ 2015-12-15  8:16   ` alvise rigo
  0 siblings, 0 replies; 60+ messages in thread
From: alvise rigo @ 2015-12-15  8:16 UTC (permalink / raw)
  To: Andreas Tobler
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson

Hi Andreas,

On Mon, Dec 14, 2015 at 11:09 PM, Andreas Tobler <andreast@fgznet.ch> wrote:
> Alvise,
>
> On 14.12.15 09:41, Alvise Rigo wrote:
>>
>> This is the sixth iteration of the patch series which applies to the
>> upstream branch of QEMU (v2.5.0-rc3).
>>
>> Changes versus previous versions are at the bottom of this cover letter.
>>
>> The code is also available at following repository:
>> https://git.virtualopensystems.com/dev/qemu-mt.git
>> branch:
>> slowpath-for-atomic-v6-no-mttcg
>
>
> Thank you very much for this work. I tried to rebase myself, but it was over
> my head.
>
> I'm looking for a qemu solution where I can use my cores.
>
> My use case is doing gcc porting for aarch64-*-freebsd*. I think it doesn't
> matter which OS. This arch has not enough real affordable HW solutions on
> the market yet. So I was looking for your solution. Claudio gave me a hint
> about it.
>
> Your recent merge/rebase only covers arm itself, not aarch64, right?

Indeed, only arm. Keep in mind that this patch series applies to the
upstream version of QEMU, not to the mttcg branch.
In other words, the repo includes a version of QEMU which is
single-threaded with some changes for the atomic instructions handling
in sight of a multi-threaded emulation.

>
> Linking fails with unreferenced cpu_exclusive_addr stuff in
> target-arm/translate-a64.c

Even if aarch64 is not supported, this error should not happen. My
fault, I will fix it in the coming version.

>
> Are you working on this already? Or Claudio?

As soon as the mttcg branch will be updated, I will rebase this patch
series on top of the new branch, and possibly I will also cover the
aarch64 architecture.

Thank you,
alvise

>
>> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>
>
> ...
>
> Thank you!
> Andreas
>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14 10:17     ` Paolo Bonzini
@ 2015-12-15 13:59       ` alvise rigo
  2015-12-15 14:18         ` Paolo Bonzini
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2015-12-15 13:59 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson

Hi Paolo,

On Mon, Dec 14, 2015 at 11:17 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 14/12/2015 11:04, alvise rigo wrote:
>> In any case, what I proposed in the mttcg based v5 was:
>> - A LL ensures that the TLB_EXCL flag is set on all the CPU's TLB.
>> This is done by querying a TLB flush to all (not exactly all...) the
>> CPUs. To be 100% safe, probably we should also wait that the flush is
>> actually performed
>> - A TLB_EXCL flag set always forces the slow-path, allowing the CPUs
>> to check for possible collision with a "exclusive memory region"
>>
>> Now, why the fact of querying the flush (and possibly ensuring that
>> the flush has been actually done) should not be enough?
>
> There will always be a race where the normal store fails.  While I
> haven't studied your code enough to do a constructive proof, it's enough
> to prove the impossibility of what you're trying to do.  Mind, I also
> believed for a long time that it was possible to do it!
>
> If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
> store, you can model this as a consensus problem.  For example, CPU 0
> could propose that the subsequent SC succeeds, while CPU 1 proposes that
> it fails.  The outcome of the SC instruction depends on who wins.

I see your point. This, as you wrote, holds only when we attempt to
make the fast path wait-free.
However, the implementation I proposed is not wait-free and somehow
serializes the accesses made to the shared resources (that will
determine if the access was successful or not) by means of a mutex.
The assumption I made - and somehow verified - is that the "colliding
fast accesses" are rare. I guess you also agree on this, otherwise how
could a wait-free implementation possibly work without being coupled
with primitives with appropriate consensus number?

Thank you,
alvise

>
> Therefore, implementing LL/SC problem requires---on both CPU 0 and CPU
> 1, and hence for both LL/SC and normal store---an atomic primitive with
> consensus number >= 2.  Other than LL/SC itself, the commonly-available
> operations satisfying this requirement are test-and-set (consensus
> number 2) and compare-and-swap (infinite consensus number).  Normal
> memory reads and writes (called "atomic registers" in multi-processing
> research lingo) have consensus number 1; it's not enough.
>
> If the host had LL/SC, CPU 1 could in principle delegate its side of the
> consensus problem to the processor; but even that is not a solution
> because processors constrain the instructions that can appear between
> the load and the store, and this could cause an infinite sequence of
> spurious failed SCs.  Another option is transactional memory, but it's
> also too slow for normal stores.
>
> The simplest solution is not to implement full LL/SC semantics; instead,
> similar to linux-user, a SC operation can perform a cmpxchg from the
> value fetched by LL to the argument of SC.  This bypasses the issue
> because stores do not have to be instrumented at all, but it does mean
> that the emulation suffers from the ABA problem.
>
> TLB_EXCL is also a middle-ground, a little bit stronger than cmpxchg.
> It's more complex and more accurate, but also not perfect.  Which is
> okay, but has to be documented.
>
> Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-15 13:59       ` alvise rigo
@ 2015-12-15 14:18         ` Paolo Bonzini
  2015-12-15 14:22           ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-15 14:18 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson



On 15/12/2015 14:59, alvise rigo wrote:
>> > If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
>> > store, you can model this as a consensus problem.  For example, CPU 0
>> > could propose that the subsequent SC succeeds, while CPU 1 proposes that
>> > it fails.  The outcome of the SC instruction depends on who wins.
> I see your point. This, as you wrote, holds only when we attempt to
> make the fast path wait-free.
> However, the implementation I proposed is not wait-free and somehow
> serializes the accesses made to the shared resources (that will
> determine if the access was successful or not) by means of a mutex.
> The assumption I made - and somehow verified - is that the "colliding
> fast accesses" are rare.

Isn't the fast path (where TLB_EXCL is not set) wait-free?

This is enough to mess up the theory, though in practice it works.

> I guess you also agree on this, otherwise how
> could a wait-free implementation possibly work without being coupled
> with primitives with appropriate consensus number?

It couldn't. :)

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-15 14:18         ` Paolo Bonzini
@ 2015-12-15 14:22           ` alvise rigo
  0 siblings, 0 replies; 60+ messages in thread
From: alvise rigo @ 2015-12-15 14:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Emilio G. Cota,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson

On Tue, Dec 15, 2015 at 3:18 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 15/12/2015 14:59, alvise rigo wrote:
>>> > If we have two CPUs, with CPU 0 executing LL and the CPU 1 executing a
>>> > store, you can model this as a consensus problem.  For example, CPU 0
>>> > could propose that the subsequent SC succeeds, while CPU 1 proposes that
>>> > it fails.  The outcome of the SC instruction depends on who wins.
>> I see your point. This, as you wrote, holds only when we attempt to
>> make the fast path wait-free.
>> However, the implementation I proposed is not wait-free and somehow
>> serializes the accesses made to the shared resources (that will
>> determine if the access was successful or not) by means of a mutex.
>> The assumption I made - and somehow verified - is that the "colliding
>> fast accesses" are rare.
>
> Isn't the fast path (where TLB_EXCL is not set) wait-free?

There is no such a fast path if we force every CPU to exit the TB and
flush the TLB.
I though that with "fast path" you were referring to a slow path
forced through TLB_EXCL, sorry.

alvise

>
> This is enough to mess up the theory, though in practice it works.
>
>> I guess you also agree on this, otherwise how
>> could a wait-free implementation possibly work without being coupled
>> with primitives with appropriate consensus number?
>
> It couldn't. :)
>
> Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-12-14 10:14   ` Laurent Vivier
@ 2015-12-15 14:23     ` alvise rigo
  2015-12-15 14:31       ` Paolo Bonzini
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2015-12-15 14:23 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Alex Bennée, Richard Henderson

Hi,

On Mon, Dec 14, 2015 at 11:14 AM, Laurent Vivier <lvivier@redhat.com> wrote:
>
>
> On 14/12/2015 09:41, Alvise Rigo wrote:
>> Use the new slow path for atomic instruction translation when the
>> softmmu is enabled.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  configure | 4 ++++
>>  1 file changed, 4 insertions(+)
>>
>> diff --git a/configure b/configure
>> index b9552fd..cc3891a 100755
>> --- a/configure
>> +++ b/configure
>> @@ -4794,6 +4794,7 @@ echo "Install blobs     $blobs"
>>  echo "KVM support       $kvm"
>>  echo "RDMA support      $rdma"
>>  echo "TCG interpreter   $tcg_interpreter"
>> +echo "use ld/st excl    $softmmu"
>>  echo "fdt support       $fdt"
>>  echo "preadv support    $preadv"
>>  echo "fdatasync         $fdatasync"
>> @@ -5186,6 +5187,9 @@ fi
>>  if test "$tcg_interpreter" = "yes" ; then
>>    echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
>>  fi
>> +if test "$softmmu" = "yes" ; then
>> +  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
>> +fi
>
> why is this "$softmmu" and not "$target_softmmu" ?

I see now that is $target_softmmu setting CONFIG_SOFTMMU=y.
So for my understanding, which are the cases where $softmmu is set
while $target_softmmu is not?

Thank you,
alvise

>
>>  if test "$fdatasync" = "yes" ; then
>>    echo "CONFIG_FDATASYNC=y" >> $config_host_mak
>>  fi
>>

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses
  2015-12-14  9:35   ` Paolo Bonzini
@ 2015-12-15 14:26     ` alvise rigo
  0 siblings, 0 replies; 60+ messages in thread
From: alvise rigo @ 2015-12-15 14:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée,
	Richard Henderson

On Mon, Dec 14, 2015 at 10:35 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
>
>
> On 14/12/2015 09:41, Alvise Rigo wrote:
>> +static inline void excl_history_put_addr(CPUState *cpu, hwaddr addr)
>> +{
>> +    /* Avoid some overhead if the address we are about to put is equal to
>> +     * the last one */
>> +    if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
>> +                                    (addr & TARGET_PAGE_MASK)) {
>> +        cpu->excl_protected_last = (cpu->excl_protected_last + 1) %
>> +                                            EXCLUSIVE_HISTORY_LEN;
>
> Either use "&" here...
>
>> +        /* Unset EXCL bit of the oldest entry */
>> +        if (cpu->excl_protected_addr[cpu->excl_protected_last] !=
>> +                                            EXCLUSIVE_RESET_ADDR) {
>> +            cpu_physical_memory_unset_excl(
>> +                cpu->excl_protected_addr[cpu->excl_protected_last],
>> +                cpu->cpu_index);
>> +        }
>> +
>> +        /* Add a new address, overwriting the oldest one */
>> +        cpu->excl_protected_addr[cpu->excl_protected_last] =
>> +                                            addr & TARGET_PAGE_MASK;
>> +    }
>> +}
>> +
>>  #define MMUSUFFIX _mmu
>>
>>  /* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 9e409ce..5f65ebf 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -217,6 +217,7 @@ struct kvm_run;
>>
>>  /* Atomic insn translation TLB support. */
>>  #define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +#define EXCLUSIVE_HISTORY_LEN 8
>>
>>  /**
>>   * CPUState:
>> @@ -343,6 +344,8 @@ struct CPUState {
>>       * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
>>       * in the middle of a LL/SC. */
>>      struct Range excl_protected_range;
>> +    hwaddr excl_protected_addr[EXCLUSIVE_HISTORY_LEN];
>> +    int excl_protected_last;
>
> ... or make this an "unsigned int".  Otherwise the code will contain an
> actual (and slow) modulo operation.

Absolutely true.

Thank you,
alvise

>
> Paolo
>
>>      /* Used to carry the SC result but also to flag a normal (legacy)
>>       * store access made by a stcond (see softmmu_template.h). */
>>      int excl_succeeded;

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-12-15 14:23     ` alvise rigo
@ 2015-12-15 14:31       ` Paolo Bonzini
  2015-12-15 15:18         ` Laurent Vivier
  0 siblings, 1 reply; 60+ messages in thread
From: Paolo Bonzini @ 2015-12-15 14:31 UTC (permalink / raw)
  To: alvise rigo, Laurent Vivier
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée,
	Richard Henderson



On 15/12/2015 15:23, alvise rigo wrote:
> Hi,
> 
> On Mon, Dec 14, 2015 at 11:14 AM, Laurent Vivier <lvivier@redhat.com> wrote:
>>
>>
>> On 14/12/2015 09:41, Alvise Rigo wrote:
>>> Use the new slow path for atomic instruction translation when the
>>> softmmu is enabled.
>>>
>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  configure | 4 ++++
>>>  1 file changed, 4 insertions(+)
>>>
>>> diff --git a/configure b/configure
>>> index b9552fd..cc3891a 100755
>>> --- a/configure
>>> +++ b/configure
>>> @@ -4794,6 +4794,7 @@ echo "Install blobs     $blobs"
>>>  echo "KVM support       $kvm"
>>>  echo "RDMA support      $rdma"
>>>  echo "TCG interpreter   $tcg_interpreter"
>>> +echo "use ld/st excl    $softmmu"
>>>  echo "fdt support       $fdt"
>>>  echo "preadv support    $preadv"
>>>  echo "fdatasync         $fdatasync"
>>> @@ -5186,6 +5187,9 @@ fi
>>>  if test "$tcg_interpreter" = "yes" ; then
>>>    echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
>>>  fi
>>> +if test "$softmmu" = "yes" ; then
>>> +  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
>>> +fi
>>
>> why is this "$softmmu" and not "$target_softmmu" ?
> 
> I see now that is $target_softmmu setting CONFIG_SOFTMMU=y.
> So for my understanding, which are the cases where $softmmu is set
> while $target_softmmu is not?

When compiling foo-linux-user.

Paolo

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled
  2015-12-15 14:31       ` Paolo Bonzini
@ 2015-12-15 15:18         ` Laurent Vivier
  0 siblings, 0 replies; 60+ messages in thread
From: Laurent Vivier @ 2015-12-15 15:18 UTC (permalink / raw)
  To: Paolo Bonzini, alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Jani Kokkonen,
	VirtualOpenSystems Technical Team, Alex Bennée,
	Richard Henderson



On 15/12/2015 15:31, Paolo Bonzini wrote:
> 
> 
> On 15/12/2015 15:23, alvise rigo wrote:
>> Hi,
>>
>> On Mon, Dec 14, 2015 at 11:14 AM, Laurent Vivier <lvivier@redhat.com> wrote:
>>>
>>>
>>> On 14/12/2015 09:41, Alvise Rigo wrote:
>>>> Use the new slow path for atomic instruction translation when the
>>>> softmmu is enabled.
>>>>
>>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>>> ---
>>>>  configure | 4 ++++
>>>>  1 file changed, 4 insertions(+)
>>>>
>>>> diff --git a/configure b/configure
>>>> index b9552fd..cc3891a 100755
>>>> --- a/configure
>>>> +++ b/configure
>>>> @@ -4794,6 +4794,7 @@ echo "Install blobs     $blobs"
>>>>  echo "KVM support       $kvm"
>>>>  echo "RDMA support      $rdma"
>>>>  echo "TCG interpreter   $tcg_interpreter"
>>>> +echo "use ld/st excl    $softmmu"
>>>>  echo "fdt support       $fdt"
>>>>  echo "preadv support    $preadv"
>>>>  echo "fdatasync         $fdatasync"
>>>> @@ -5186,6 +5187,9 @@ fi
>>>>  if test "$tcg_interpreter" = "yes" ; then
>>>>    echo "CONFIG_TCG_INTERPRETER=y" >> $config_host_mak
>>>>  fi
>>>> +if test "$softmmu" = "yes" ; then
>>>> +  echo "CONFIG_TCG_USE_LDST_EXCL=y" >> $config_host_mak
>>>> +fi
>>>
>>> why is this "$softmmu" and not "$target_softmmu" ?
>>
>> I see now that is $target_softmmu setting CONFIG_SOFTMMU=y.
>> So for my understanding, which are the cases where $softmmu is set
>> while $target_softmmu is not?
> 
> When compiling foo-linux-user.

In fact, after having asked the question, I've found that it is right to
use softmmu which is a host parameter (while target_softmmu is guest
parameter) and softmmu is set to true if there is at least one *-softmmu
target.

Laurent

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (15 preceding siblings ...)
  2015-12-14 22:09 ` Andreas Tobler
@ 2015-12-17 16:06 ` Alex Bennée
  2015-12-17 16:16   ` alvise rigo
  2016-01-06 18:00 ` Andrew Baumann
  17 siblings, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2015-12-17 16:06 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> This is the sixth iteration of the patch series which applies to the
> upstream branch of QEMU (v2.5.0-rc3).
>
> Changes versus previous versions are at the bottom of this cover letter.
>
> The code is also available at following repository:
> https://git.virtualopensystems.com/dev/qemu-mt.git
> branch:
> slowpath-for-atomic-v6-no-mttcg

I'm starting to look through this now. However one problem that
immediately comes up is the aarch64 breakage. Because there is an
intrinsic link between a lot of the arm and aarch64 code it breaks the
other targets.

You could fix this by ensuring that CONFIG_TCG_USE_LDST_EXCL doesn't get
passed to the aarch64 build (tricky as aarch64-softmmu.mak includes
arm-softmmu.mak) or bite the bullet now and add the 64 bit helpers that
will be needed to convert the aarch64 exclusive equivalents.

>
> This patch series provides an infrastructure for atomic instruction
> implementation in QEMU, thus offering a 'legacy' solution for
> translating guest atomic instructions. Moreover, it can be considered as
> a first step toward a multi-thread TCG.
>
> The underlying idea is to provide new TCG helpers (sort of softmmu
> helpers) that guarantee atomicity to some memory accesses or in general
> a way to define memory transactions.
>
> More specifically, the new softmmu helpers behave as LoadLink and
> StoreConditional instructions, and are called from TCG code by means of
> target specific helpers. This work includes the implementation for all
> the ARM atomic instructions, see target-arm/op_helper.c.
>
> The implementation heavily uses the software TLB together with a new
> bitmap that has been added to the ram_list structure which flags, on a
> per-CPU basis, all the memory pages that are in the middle of a LoadLink
> (LL), StoreConditional (SC) operation.  Since all these pages can be
> accessed directly through the fast-path and alter a vCPU's linked value,
> the new bitmap has been coupled with a new TLB flag for the TLB virtual
> address which forces the slow-path execution for all the accesses to a
> page containing a linked address.
>
> The new slow-path is implemented such that:
> - the LL behaves as a normal load slow-path, except for clearing the
>   dirty flag in the bitmap.  The cputlb.c code while generating a TLB
>   entry, checks if there is at least one vCPU that has the bit cleared
>   in the exclusive bitmap, it that case the TLB entry will have the EXCL
>   flag set, thus forcing the slow-path.  In order to ensure that all the
>   vCPUs will follow the slow-path for that page, we flush the TLB cache
>   of all the other vCPUs.
>
>   The LL will also set the linked address and size of the access in a
>   vCPU's private variable. After the corresponding SC, this address will
>   be set to a reset value.
>
> - the SC can fail returning 1, or succeed, returning 0.  It has to come
>   always after a LL and has to access the same address 'linked' by the
>   previous LL, otherwise it will fail. If in the time window delimited
>   by a legit pair of LL/SC operations another write access happens to
>   the linked address, the SC will fail.
>
> In theory, the provided implementation of TCG LoadLink/StoreConditional
> can be used to properly handle atomic instructions on any architecture.
>
> The code has been tested with bare-metal test cases and by booting Linux.
>
> * Performance considerations
> The new slow-path adds some overhead to the translation of the ARM
> atomic instructions, since their emulation doesn't happen anymore only
> in the guest (by mean of pure TCG generated code), but requires the
> execution of two helpers functions. Despite this, the additional time
> required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
> negligible.
> Instead, on a LL/SC bound test scenario - like:
> https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
> solution requires 30% (1 million iterations) and 70% (10 millions
> iterations) of additional time for the test to complete.
>
> Changes from v5:
> - The exclusive memory region is now set through a CPUClass hook,
>   allowing any architecture to decide the memory area that will be
>   protected during a LL/SC operation [PATCH 3]
> - The runtime helpers dropped any target dependency and are now in a
>   common file [PATCH 5]
> - Improved the way we restore a guest page as non-exclusive [PATCH 9]
> - Included MMIO memory as possible target of LL/SC
>   instructions. This also required to somehow simplify the
>   helper_*_st_name helpers in softmmu_template.h [PATCH 8-14]
>
> Changes from v4:
> - Reworked the exclusive bitmap to be of fixed size (8 bits per address)
> - The slow-path is now TCG backend independent, no need to touch
>   tcg/* anymore as suggested by Aurelien Jarno.
>
> Changes from v3:
> - based on upstream QEMU
> - addressed comments from Alex Bennée
> - the slow path can be enabled by the user with:
>   ./configure --enable-tcg-ldst-excl only if the backend supports it
> - all the ARM ldex/stex instructions make now use of the slow path
> - added aarch64 TCG backend support
> - part of the code has been rewritten
>
> Changes from v2:
> - the bitmap accessors are now atomic
> - a rendezvous between vCPUs and a simple callback support before executing
>   a TB have been added to handle the TLB flush support
> - the softmmu_template and softmmu_llsc_template have been adapted to work
>   on real multi-threading
>
> Changes from v1:
> - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
> - The way how the offset to access the bitmap is calculated has
>   been improved and fixed
> - A page to be set as dirty requires a vCPU to target the protected address
>   and not just an address in the page
> - Addressed comments from Richard Henderson to improve the logic in
>   softmmu_template.h and to simplify the methods generation through
>   softmmu_llsc_template.h
> - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386
>
> This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
>
> Alvise Rigo (14):
>   exec.c: Add new exclusive bitmap to ram_list
>   softmmu: Add new TLB_EXCL flag
>   Add CPUClass hook to set exclusive range
>   softmmu: Add helpers for a new slowpath
>   tcg: Create new runtime helpers for excl accesses
>   configure: Use slow-path for atomic only when the softmmu is enabled
>   target-arm: translate: Use ld/st excl for atomic insns
>   target-arm: Add atomic_clear helper for CLREX insn
>   softmmu: Add history of excl accesses
>   softmmu: Simplify helper_*_st_name, wrap unaligned code
>   softmmu: Simplify helper_*_st_name, wrap MMIO code
>   softmmu: Simplify helper_*_st_name, wrap RAM code
>   softmmu: Include MMIO/invalid exclusive accesses
>   softmmu: Protect MMIO exclusive range
>
>  Makefile.target             |   2 +-
>  configure                   |   4 +
>  cputlb.c                    |  67 ++++++++-
>  exec.c                      |   8 +-
>  include/exec/cpu-all.h      |   8 ++
>  include/exec/cpu-defs.h     |   1 +
>  include/exec/helper-gen.h   |   1 +
>  include/exec/helper-proto.h |   1 +
>  include/exec/helper-tcg.h   |   1 +
>  include/exec/memory.h       |   4 +-
>  include/exec/ram_addr.h     |  76 ++++++++++
>  include/qom/cpu.h           |  21 +++
>  qom/cpu.c                   |   7 +
>  softmmu_llsc_template.h     | 144 +++++++++++++++++++
>  softmmu_template.h          | 338 +++++++++++++++++++++++++++++++++-----------
>  target-arm/helper.h         |   2 +
>  target-arm/op_helper.c      |   6 +
>  target-arm/translate.c      | 102 ++++++++++++-
>  tcg-llsc-helper.c           | 109 ++++++++++++++
>  tcg-llsc-helper.h           |  35 +++++
>  tcg/tcg-llsc-gen-helper.h   |  32 +++++
>  tcg/tcg.h                   |  31 ++++
>  22 files changed, 909 insertions(+), 91 deletions(-)
>  create mode 100644 softmmu_llsc_template.h
>  create mode 100644 tcg-llsc-helper.c
>  create mode 100644 tcg-llsc-helper.h
>  create mode 100644 tcg/tcg-llsc-gen-helper.h


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-17 16:06 ` Alex Bennée
@ 2015-12-17 16:16   ` alvise rigo
  0 siblings, 0 replies; 60+ messages in thread
From: alvise rigo @ 2015-12-17 16:16 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

[-- Attachment #1: Type: text/plain, Size: 8799 bytes --]

Hi Alex,

On Thu, Dec 17, 2015 at 5:06 PM, Alex Bennée <alex.bennee@linaro.org> wrote:

>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
> > This is the sixth iteration of the patch series which applies to the
> > upstream branch of QEMU (v2.5.0-rc3).
> >
> > Changes versus previous versions are at the bottom of this cover letter.
> >
> > The code is also available at following repository:
> > https://git.virtualopensystems.com/dev/qemu-mt.git
> > branch:
> > slowpath-for-atomic-v6-no-mttcg
>
> I'm starting to look through this now. However one problem that
>

Thank you for this.


> immediately comes up is the aarch64 breakage. Because there is an
> intrinsic link between a lot of the arm and aarch64 code it breaks the
> other targets.
>
> You could fix this by ensuring that CONFIG_TCG_USE_LDST_EXCL doesn't get
> passed to the aarch64 build (tricky as aarch64-softmmu.mak includes
> arm-softmmu.mak) or bite the bullet now and add the 64 bit helpers that
> will be needed to convert the aarch64 exclusive equivalents.
>

This is what I'm doing right now :)

Best regards,
alvise


>
> >
> > This patch series provides an infrastructure for atomic instruction
> > implementation in QEMU, thus offering a 'legacy' solution for
> > translating guest atomic instructions. Moreover, it can be considered as
> > a first step toward a multi-thread TCG.
> >
> > The underlying idea is to provide new TCG helpers (sort of softmmu
> > helpers) that guarantee atomicity to some memory accesses or in general
> > a way to define memory transactions.
> >
> > More specifically, the new softmmu helpers behave as LoadLink and
> > StoreConditional instructions, and are called from TCG code by means of
> > target specific helpers. This work includes the implementation for all
> > the ARM atomic instructions, see target-arm/op_helper.c.
> >
> > The implementation heavily uses the software TLB together with a new
> > bitmap that has been added to the ram_list structure which flags, on a
> > per-CPU basis, all the memory pages that are in the middle of a LoadLink
> > (LL), StoreConditional (SC) operation.  Since all these pages can be
> > accessed directly through the fast-path and alter a vCPU's linked value,
> > the new bitmap has been coupled with a new TLB flag for the TLB virtual
> > address which forces the slow-path execution for all the accesses to a
> > page containing a linked address.
> >
> > The new slow-path is implemented such that:
> > - the LL behaves as a normal load slow-path, except for clearing the
> >   dirty flag in the bitmap.  The cputlb.c code while generating a TLB
> >   entry, checks if there is at least one vCPU that has the bit cleared
> >   in the exclusive bitmap, it that case the TLB entry will have the EXCL
> >   flag set, thus forcing the slow-path.  In order to ensure that all the
> >   vCPUs will follow the slow-path for that page, we flush the TLB cache
> >   of all the other vCPUs.
> >
> >   The LL will also set the linked address and size of the access in a
> >   vCPU's private variable. After the corresponding SC, this address will
> >   be set to a reset value.
> >
> > - the SC can fail returning 1, or succeed, returning 0.  It has to come
> >   always after a LL and has to access the same address 'linked' by the
> >   previous LL, otherwise it will fail. If in the time window delimited
> >   by a legit pair of LL/SC operations another write access happens to
> >   the linked address, the SC will fail.
> >
> > In theory, the provided implementation of TCG LoadLink/StoreConditional
> > can be used to properly handle atomic instructions on any architecture.
> >
> > The code has been tested with bare-metal test cases and by booting Linux.
> >
> > * Performance considerations
> > The new slow-path adds some overhead to the translation of the ARM
> > atomic instructions, since their emulation doesn't happen anymore only
> > in the guest (by mean of pure TCG generated code), but requires the
> > execution of two helpers functions. Despite this, the additional time
> > required to boot an ARM Linux kernel on an i7 clocked at 2.5GHz is
> > negligible.
> > Instead, on a LL/SC bound test scenario - like:
> > https://git.virtualopensystems.com/dev/tcg_baremetal_tests.git - this
> > solution requires 30% (1 million iterations) and 70% (10 millions
> > iterations) of additional time for the test to complete.
> >
> > Changes from v5:
> > - The exclusive memory region is now set through a CPUClass hook,
> >   allowing any architecture to decide the memory area that will be
> >   protected during a LL/SC operation [PATCH 3]
> > - The runtime helpers dropped any target dependency and are now in a
> >   common file [PATCH 5]
> > - Improved the way we restore a guest page as non-exclusive [PATCH 9]
> > - Included MMIO memory as possible target of LL/SC
> >   instructions. This also required to somehow simplify the
> >   helper_*_st_name helpers in softmmu_template.h [PATCH 8-14]
> >
> > Changes from v4:
> > - Reworked the exclusive bitmap to be of fixed size (8 bits per address)
> > - The slow-path is now TCG backend independent, no need to touch
> >   tcg/* anymore as suggested by Aurelien Jarno.
> >
> > Changes from v3:
> > - based on upstream QEMU
> > - addressed comments from Alex Bennée
> > - the slow path can be enabled by the user with:
> >   ./configure --enable-tcg-ldst-excl only if the backend supports it
> > - all the ARM ldex/stex instructions make now use of the slow path
> > - added aarch64 TCG backend support
> > - part of the code has been rewritten
> >
> > Changes from v2:
> > - the bitmap accessors are now atomic
> > - a rendezvous between vCPUs and a simple callback support before
> executing
> >   a TB have been added to handle the TLB flush support
> > - the softmmu_template and softmmu_llsc_template have been adapted to
> work
> >   on real multi-threading
> >
> > Changes from v1:
> > - The ram bitmap is not reversed anymore, 1 = dirty, 0 = exclusive
> > - The way how the offset to access the bitmap is calculated has
> >   been improved and fixed
> > - A page to be set as dirty requires a vCPU to target the protected
> address
> >   and not just an address in the page
> > - Addressed comments from Richard Henderson to improve the logic in
> >   softmmu_template.h and to simplify the methods generation through
> >   softmmu_llsc_template.h
> > - Added initial implementation of qemu_{ldlink,stcond}_i32 for tcg/i386
> >
> > This work has been sponsored by Huawei Technologies Duesseldorf GmbH.
> >
> > Alvise Rigo (14):
> >   exec.c: Add new exclusive bitmap to ram_list
> >   softmmu: Add new TLB_EXCL flag
> >   Add CPUClass hook to set exclusive range
> >   softmmu: Add helpers for a new slowpath
> >   tcg: Create new runtime helpers for excl accesses
> >   configure: Use slow-path for atomic only when the softmmu is enabled
> >   target-arm: translate: Use ld/st excl for atomic insns
> >   target-arm: Add atomic_clear helper for CLREX insn
> >   softmmu: Add history of excl accesses
> >   softmmu: Simplify helper_*_st_name, wrap unaligned code
> >   softmmu: Simplify helper_*_st_name, wrap MMIO code
> >   softmmu: Simplify helper_*_st_name, wrap RAM code
> >   softmmu: Include MMIO/invalid exclusive accesses
> >   softmmu: Protect MMIO exclusive range
> >
> >  Makefile.target             |   2 +-
> >  configure                   |   4 +
> >  cputlb.c                    |  67 ++++++++-
> >  exec.c                      |   8 +-
> >  include/exec/cpu-all.h      |   8 ++
> >  include/exec/cpu-defs.h     |   1 +
> >  include/exec/helper-gen.h   |   1 +
> >  include/exec/helper-proto.h |   1 +
> >  include/exec/helper-tcg.h   |   1 +
> >  include/exec/memory.h       |   4 +-
> >  include/exec/ram_addr.h     |  76 ++++++++++
> >  include/qom/cpu.h           |  21 +++
> >  qom/cpu.c                   |   7 +
> >  softmmu_llsc_template.h     | 144 +++++++++++++++++++
> >  softmmu_template.h          | 338
> +++++++++++++++++++++++++++++++++-----------
> >  target-arm/helper.h         |   2 +
> >  target-arm/op_helper.c      |   6 +
> >  target-arm/translate.c      | 102 ++++++++++++-
> >  tcg-llsc-helper.c           | 109 ++++++++++++++
> >  tcg-llsc-helper.h           |  35 +++++
> >  tcg/tcg-llsc-gen-helper.h   |  32 +++++
> >  tcg/tcg.h                   |  31 ++++
> >  22 files changed, 909 insertions(+), 91 deletions(-)
> >  create mode 100644 softmmu_llsc_template.h
> >  create mode 100644 tcg-llsc-helper.c
> >  create mode 100644 tcg-llsc-helper.h
> >  create mode 100644 tcg/tcg-llsc-gen-helper.h
>
>
> --
> Alex Bennée
>

[-- Attachment #2: Type: text/html, Size: 10929 bytes --]

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code Alvise Rigo
@ 2015-12-17 16:52   ` Alex Bennée
  2015-12-17 17:13     ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2015-12-17 16:52 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Attempting to simplify the helper_*_st_name, wrap the code relative to a
> RAM access into an inline function.

This commit breaks a default x86_64-softmmu build:

  CC    x86_64-softmmu/../hw/audio/pcspk.o
In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:527:0:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_ret_stb_mmu’:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
             haddr = addr + env->tlb_table[mmu_idx][index].addend;
             ^
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: note: each undeclared identifier is reported only once for each function it appears in
In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:530:0:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stw_mmu’:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
             haddr = addr + env->tlb_table[mmu_idx][index].addend;
             ^
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stw_mmu’:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
             haddr = addr + env->tlb_table[mmu_idx][index].addend;
             ^
In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:533:0:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stl_mmu’:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
             haddr = addr + env->tlb_table[mmu_idx][index].addend;
             ^
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stl_mmu’:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
             haddr = addr + env->tlb_table[mmu_idx][index].addend;
             ^
In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:536:0:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stq_mmu’:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
             haddr = addr + env->tlb_table[mmu_idx][index].addend;
             ^
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stq_mmu’:
/home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
             haddr = addr + env->tlb_table[mmu_idx][index].addend;
             ^
make[1]: *** [cputlb.o] Error 1
make[1]: *** Waiting for unfinished jobs....
  CC    x86_64-softmmu/../hw/block/fdc.o
make: *** [subdir-x86_64-softmmu] Error 2

ERROR: commit 3a371deaf11ce944127a00eadbc7e811b6798de1 failed to build!
commit 3a371deaf11ce944127a00eadbc7e811b6798de1
Author: Alvise Rigo <a.rigo@virtualopensystems.com>
Date:   Thu Dec 10 17:26:54 2015 +0100

    softmmu: Simplify helper_*_st_name, wrap RAM code

Found while checking with Jeff's compile-check script:

https://github.com/codyprime/git-scripts

git compile-check -r c3626ca7df027dabf0568284360a23faf18f0884..HEAD

>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  softmmu_template.h | 110 +++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 68 insertions(+), 42 deletions(-)
>
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 2ebf527..262c95f 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -416,13 +416,46 @@ static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
>      glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>  }
>
> +static inline void glue(helper_le_st_name, _do_ram_access)(CPUArchState *env,
> +                                                           DATA_TYPE val,
> +                                                           target_ulong addr,
> +                                                           TCGMemOpIdx oi,
> +                                                           unsigned mmu_idx,
> +                                                           int index,
> +                                                           uintptr_t retaddr)
> +{
> +    uintptr_t haddr;
> +
> +    /* Handle slow unaligned access (it spans two pages or IO).  */
> +    if (DATA_SIZE > 1
> +        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> +                     >= TARGET_PAGE_SIZE)) {
> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> +                                                retaddr);
> +        return;
> +    }
> +
> +    /* Handle aligned access or unaligned access in the same page.  */
> +    if ((addr & (DATA_SIZE - 1)) != 0
> +        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                             mmu_idx, retaddr);
> +    }
> +
> +    haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +#if DATA_SIZE == 1
> +    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
> +#else
> +    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
> +#endif
> +}
> +
>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
>      unsigned mmu_idx = get_mmuidx(oi);
>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> -    uintptr_t haddr;
>
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -484,28 +517,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>          }
>      }
>
> -    /* Handle slow unaligned access (it spans two pages or IO).  */
> -    if (DATA_SIZE > 1
> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> -                     >= TARGET_PAGE_SIZE)) {
> -        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> -                                                retaddr);
> -        return;
> -    }
> -
> -    /* Handle aligned access or unaligned access in the same page.  */
> -    if ((addr & (DATA_SIZE - 1)) != 0
> -        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                             mmu_idx, retaddr);
> -    }
> -
> -    haddr = addr + env->tlb_table[mmu_idx][index].addend;
> -#if DATA_SIZE == 1
> -    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
> -#else
> -    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
> -#endif
> +    glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
> +                                            retaddr);
>  }
>
>  #if DATA_SIZE > 1
> @@ -555,13 +568,42 @@ static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
>      glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>  }
>
> +static inline void glue(helper_be_st_name, _do_ram_access)(CPUArchState *env,
> +                                                           DATA_TYPE val,
> +                                                           target_ulong addr,
> +                                                           TCGMemOpIdx oi,
> +                                                           unsigned mmu_idx,
> +                                                           int index,
> +                                                           uintptr_t retaddr)
> +{
> +    uintptr_t haddr;
> +
> +    /* Handle slow unaligned access (it spans two pages or IO).  */
> +    if (DATA_SIZE > 1
> +        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> +                     >= TARGET_PAGE_SIZE)) {
> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> +                                                retaddr);
> +        return;
> +    }
> +
> +    /* Handle aligned access or unaligned access in the same page.  */
> +    if ((addr & (DATA_SIZE - 1)) != 0
> +        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                             mmu_idx, retaddr);
> +    }
> +
> +    haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
> +}
> +
>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
>      unsigned mmu_idx = get_mmuidx(oi);
>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
> -    uintptr_t haddr;
>
>      /* Adjust the given return address.  */
>      retaddr -= GETPC_ADJ;
> @@ -623,24 +665,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>          }
>      }
>
> -    /* Handle slow unaligned access (it spans two pages or IO).  */
> -    if (DATA_SIZE > 1
> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
> -                     >= TARGET_PAGE_SIZE)) {
> -        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> -                                                retaddr);
> -        return;
> -    }
> -
> -    /* Handle aligned access or unaligned access in the same page.  */
> -    if ((addr & (DATA_SIZE - 1)) != 0
> -        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                             mmu_idx, retaddr);
> -    }
> -
> -    haddr = addr + env->tlb_table[mmu_idx][index].addend;
> -    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
> +    glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
> +                                            retaddr);
>  }
>  #endif /* DATA_SIZE > 1 */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code
  2015-12-17 16:52   ` Alex Bennée
@ 2015-12-17 17:13     ` alvise rigo
  2015-12-17 20:20       ` Alex Bennée
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2015-12-17 17:13 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

On Thu, Dec 17, 2015 at 5:52 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Attempting to simplify the helper_*_st_name, wrap the code relative to a
>> RAM access into an inline function.
>
> This commit breaks a default x86_64-softmmu build:

I see. Would these three commits make more sense if squashed together?
Or better to leave them distinct and fix the compilation issue?

BTW, I will now start using that script.

Thank you,
alvise

>
>   CC    x86_64-softmmu/../hw/audio/pcspk.o
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:527:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_ret_stb_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>              ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: note: each undeclared identifier is reported only once for each function it appears in
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:530:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stw_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>              ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stw_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>              ^
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:533:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stl_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>              ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stl_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>              ^
> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:536:0:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stq_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>              ^
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stq_mmu’:
> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>              ^
> make[1]: *** [cputlb.o] Error 1
> make[1]: *** Waiting for unfinished jobs....
>   CC    x86_64-softmmu/../hw/block/fdc.o
> make: *** [subdir-x86_64-softmmu] Error 2
>
> ERROR: commit 3a371deaf11ce944127a00eadbc7e811b6798de1 failed to build!
> commit 3a371deaf11ce944127a00eadbc7e811b6798de1
> Author: Alvise Rigo <a.rigo@virtualopensystems.com>
> Date:   Thu Dec 10 17:26:54 2015 +0100
>
>     softmmu: Simplify helper_*_st_name, wrap RAM code
>
> Found while checking with Jeff's compile-check script:
>
> https://github.com/codyprime/git-scripts
>
> git compile-check -r c3626ca7df027dabf0568284360a23faf18f0884..HEAD
>
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  softmmu_template.h | 110 +++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 68 insertions(+), 42 deletions(-)
>>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 2ebf527..262c95f 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -416,13 +416,46 @@ static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
>>      glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>  }
>>
>> +static inline void glue(helper_le_st_name, _do_ram_access)(CPUArchState *env,
>> +                                                           DATA_TYPE val,
>> +                                                           target_ulong addr,
>> +                                                           TCGMemOpIdx oi,
>> +                                                           unsigned mmu_idx,
>> +                                                           int index,
>> +                                                           uintptr_t retaddr)
>> +{
>> +    uintptr_t haddr;
>> +
>> +    /* Handle slow unaligned access (it spans two pages or IO).  */
>> +    if (DATA_SIZE > 1
>> +        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>> +                     >= TARGET_PAGE_SIZE)) {
>> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>> +                                                retaddr);
>> +        return;
>> +    }
>> +
>> +    /* Handle aligned access or unaligned access in the same page.  */
>> +    if ((addr & (DATA_SIZE - 1)) != 0
>> +        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> +                             mmu_idx, retaddr);
>> +    }
>> +
>> +    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> +#if DATA_SIZE == 1
>> +    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>> +#else
>> +    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>> +#endif
>> +}
>> +
>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>  {
>>      unsigned mmu_idx = get_mmuidx(oi);
>>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>> -    uintptr_t haddr;
>>
>>      /* Adjust the given return address.  */
>>      retaddr -= GETPC_ADJ;
>> @@ -484,28 +517,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>          }
>>      }
>>
>> -    /* Handle slow unaligned access (it spans two pages or IO).  */
>> -    if (DATA_SIZE > 1
>> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>> -                     >= TARGET_PAGE_SIZE)) {
>> -        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>> -                                                retaddr);
>> -        return;
>> -    }
>> -
>> -    /* Handle aligned access or unaligned access in the same page.  */
>> -    if ((addr & (DATA_SIZE - 1)) != 0
>> -        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> -        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> -                             mmu_idx, retaddr);
>> -    }
>> -
>> -    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> -#if DATA_SIZE == 1
>> -    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>> -#else
>> -    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>> -#endif
>> +    glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
>> +                                            retaddr);
>>  }
>>
>>  #if DATA_SIZE > 1
>> @@ -555,13 +568,42 @@ static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
>>      glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>  }
>>
>> +static inline void glue(helper_be_st_name, _do_ram_access)(CPUArchState *env,
>> +                                                           DATA_TYPE val,
>> +                                                           target_ulong addr,
>> +                                                           TCGMemOpIdx oi,
>> +                                                           unsigned mmu_idx,
>> +                                                           int index,
>> +                                                           uintptr_t retaddr)
>> +{
>> +    uintptr_t haddr;
>> +
>> +    /* Handle slow unaligned access (it spans two pages or IO).  */
>> +    if (DATA_SIZE > 1
>> +        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>> +                     >= TARGET_PAGE_SIZE)) {
>> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>> +                                                retaddr);
>> +        return;
>> +    }
>> +
>> +    /* Handle aligned access or unaligned access in the same page.  */
>> +    if ((addr & (DATA_SIZE - 1)) != 0
>> +        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> +                             mmu_idx, retaddr);
>> +    }
>> +
>> +    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> +    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
>> +}
>> +
>>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>  {
>>      unsigned mmu_idx = get_mmuidx(oi);
>>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>> -    uintptr_t haddr;
>>
>>      /* Adjust the given return address.  */
>>      retaddr -= GETPC_ADJ;
>> @@ -623,24 +665,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>          }
>>      }
>>
>> -    /* Handle slow unaligned access (it spans two pages or IO).  */
>> -    if (DATA_SIZE > 1
>> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>> -                     >= TARGET_PAGE_SIZE)) {
>> -        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>> -                                                retaddr);
>> -        return;
>> -    }
>> -
>> -    /* Handle aligned access or unaligned access in the same page.  */
>> -    if ((addr & (DATA_SIZE - 1)) != 0
>> -        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> -        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> -                             mmu_idx, retaddr);
>> -    }
>> -
>> -    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> -    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
>> +    glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
>> +                                            retaddr);
>>  }
>>  #endif /* DATA_SIZE > 1 */
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code
  2015-12-17 17:13     ` alvise rigo
@ 2015-12-17 20:20       ` Alex Bennée
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bennée @ 2015-12-17 20:20 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson


alvise rigo <a.rigo@virtualopensystems.com> writes:

> On Thu, Dec 17, 2015 at 5:52 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>>
>>> Attempting to simplify the helper_*_st_name, wrap the code relative to a
>>> RAM access into an inline function.
>>
>> This commit breaks a default x86_64-softmmu build:
>
> I see. Would these three commits make more sense if squashed together?
> Or better to leave them distinct and fix the compilation issue?

Keep them distinct if you can as long as the compilation is ok for bisectablity.

>
> BTW, I will now start using that script.
>
> Thank you,
> alvise
>
>>
>>   CC    x86_64-softmmu/../hw/audio/pcspk.o
>> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:527:0:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_ret_stb_mmu’:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>              ^
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: note: each undeclared identifier is reported only once for each function it appears in
>> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:530:0:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stw_mmu’:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>              ^
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stw_mmu’:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
>>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>              ^
>> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:533:0:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stl_mmu’:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>              ^
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stl_mmu’:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
>>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>              ^
>> In file included from /home/alex/lsrc/qemu/qemu.git/cputlb.c:536:0:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_le_stq_mmu’:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:503:13: error: ‘haddr’ undeclared (first use in this function)
>>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>              ^
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h: In function ‘helper_be_stq_mmu’:
>> /home/alex/lsrc/qemu/qemu.git/softmmu_template.h:651:13: error: ‘haddr’ undeclared (first use in this function)
>>              haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>              ^
>> make[1]: *** [cputlb.o] Error 1
>> make[1]: *** Waiting for unfinished jobs....
>>   CC    x86_64-softmmu/../hw/block/fdc.o
>> make: *** [subdir-x86_64-softmmu] Error 2
>>
>> ERROR: commit 3a371deaf11ce944127a00eadbc7e811b6798de1 failed to build!
>> commit 3a371deaf11ce944127a00eadbc7e811b6798de1
>> Author: Alvise Rigo <a.rigo@virtualopensystems.com>
>> Date:   Thu Dec 10 17:26:54 2015 +0100
>>
>>     softmmu: Simplify helper_*_st_name, wrap RAM code
>>
>> Found while checking with Jeff's compile-check script:
>>
>> https://github.com/codyprime/git-scripts
>>
>> git compile-check -r c3626ca7df027dabf0568284360a23faf18f0884..HEAD
>>
>>>
>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  softmmu_template.h | 110 +++++++++++++++++++++++++++++++++--------------------
>>>  1 file changed, 68 insertions(+), 42 deletions(-)
>>>
>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>> index 2ebf527..262c95f 100644
>>> --- a/softmmu_template.h
>>> +++ b/softmmu_template.h
>>> @@ -416,13 +416,46 @@ static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
>>>      glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>>  }
>>>
>>> +static inline void glue(helper_le_st_name, _do_ram_access)(CPUArchState *env,
>>> +                                                           DATA_TYPE val,
>>> +                                                           target_ulong addr,
>>> +                                                           TCGMemOpIdx oi,
>>> +                                                           unsigned mmu_idx,
>>> +                                                           int index,
>>> +                                                           uintptr_t retaddr)
>>> +{
>>> +    uintptr_t haddr;
>>> +
>>> +    /* Handle slow unaligned access (it spans two pages or IO).  */
>>> +    if (DATA_SIZE > 1
>>> +        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>> +                     >= TARGET_PAGE_SIZE)) {
>>> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>> +                                                retaddr);
>>> +        return;
>>> +    }
>>> +
>>> +    /* Handle aligned access or unaligned access in the same page.  */
>>> +    if ((addr & (DATA_SIZE - 1)) != 0
>>> +        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> +                             mmu_idx, retaddr);
>>> +    }
>>> +
>>> +    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>> +#if DATA_SIZE == 1
>>> +    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>>> +#else
>>> +    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>>> +#endif
>>> +}
>>> +
>>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>  {
>>>      unsigned mmu_idx = get_mmuidx(oi);
>>>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>> -    uintptr_t haddr;
>>>
>>>      /* Adjust the given return address.  */
>>>      retaddr -= GETPC_ADJ;
>>> @@ -484,28 +517,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>          }
>>>      }
>>>
>>> -    /* Handle slow unaligned access (it spans two pages or IO).  */
>>> -    if (DATA_SIZE > 1
>>> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>> -                     >= TARGET_PAGE_SIZE)) {
>>> -        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>> -                                                retaddr);
>>> -        return;
>>> -    }
>>> -
>>> -    /* Handle aligned access or unaligned access in the same page.  */
>>> -    if ((addr & (DATA_SIZE - 1)) != 0
>>> -        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> -        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> -                             mmu_idx, retaddr);
>>> -    }
>>> -
>>> -    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>> -#if DATA_SIZE == 1
>>> -    glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>>> -#else
>>> -    glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>>> -#endif
>>> +    glue(helper_le_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
>>> +                                            retaddr);
>>>  }
>>>
>>>  #if DATA_SIZE > 1
>>> @@ -555,13 +568,42 @@ static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
>>>      glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>>  }
>>>
>>> +static inline void glue(helper_be_st_name, _do_ram_access)(CPUArchState *env,
>>> +                                                           DATA_TYPE val,
>>> +                                                           target_ulong addr,
>>> +                                                           TCGMemOpIdx oi,
>>> +                                                           unsigned mmu_idx,
>>> +                                                           int index,
>>> +                                                           uintptr_t retaddr)
>>> +{
>>> +    uintptr_t haddr;
>>> +
>>> +    /* Handle slow unaligned access (it spans two pages or IO).  */
>>> +    if (DATA_SIZE > 1
>>> +        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>> +                     >= TARGET_PAGE_SIZE)) {
>>> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>> +                                                retaddr);
>>> +        return;
>>> +    }
>>> +
>>> +    /* Handle aligned access or unaligned access in the same page.  */
>>> +    if ((addr & (DATA_SIZE - 1)) != 0
>>> +        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> +                             mmu_idx, retaddr);
>>> +    }
>>> +
>>> +    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>> +    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
>>> +}
>>> +
>>>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>  {
>>>      unsigned mmu_idx = get_mmuidx(oi);
>>>      int index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
>>>      target_ulong tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>> -    uintptr_t haddr;
>>>
>>>      /* Adjust the given return address.  */
>>>      retaddr -= GETPC_ADJ;
>>> @@ -623,24 +665,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>          }
>>>      }
>>>
>>> -    /* Handle slow unaligned access (it spans two pages or IO).  */
>>> -    if (DATA_SIZE > 1
>>> -        && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>> -                     >= TARGET_PAGE_SIZE)) {
>>> -        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>> -                                                retaddr);
>>> -        return;
>>> -    }
>>> -
>>> -    /* Handle aligned access or unaligned access in the same page.  */
>>> -    if ((addr & (DATA_SIZE - 1)) != 0
>>> -        && (get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> -        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> -                             mmu_idx, retaddr);
>>> -    }
>>> -
>>> -    haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>> -    glue(glue(st, SUFFIX), _be_p)((uint8_t *)haddr, val);
>>> +    glue(helper_be_st_name, _do_ram_access)(env, val, addr, oi, mmu_idx, index,
>>> +                                            retaddr);
>>>  }
>>>  #endif /* DATA_SIZE > 1 */
>>
>>
>> --
>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
@ 2015-12-18 13:18   ` Alex Bennée
  2015-12-18 13:47     ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2015-12-18 13:18 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> The purpose of this new bitmap is to flag the memory pages that are in
> the middle of LL/SC operations (after a LL, before a SC) on a per-vCPU
> basis.
> For all these pages, the corresponding TLB entries will be generated
> in such a way to force the slow-path if at least one vCPU has the bit
> not set.
> When the system starts, the whole memory is dirty (all the bitmap is
> set). A page, after being marked as exclusively-clean, will be
> restored as dirty after the SC.
>
> For each page we keep 8 bits to be shared among all the vCPUs available
> in the system. In general, the to the vCPU n correspond the bit n % 8.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  exec.c                  |  8 ++++--
>  include/exec/memory.h   |  3 +-
>  include/exec/ram_addr.h | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 84 insertions(+), 3 deletions(-)
>
> diff --git a/exec.c b/exec.c
> index 0bf0a6e..e66d232 100644
> --- a/exec.c
> +++ b/exec.c
> @@ -1548,11 +1548,15 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
>          int i;
>
>          /* ram_list.dirty_memory[] is protected by the iothread lock.  */
> -        for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
> +        for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
>              ram_list.dirty_memory[i] =
>                  bitmap_zero_extend(ram_list.dirty_memory[i],
>                                     old_ram_size, new_ram_size);
> -       }
> +        }
> +        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] = bitmap_zero_extend(
> +                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
> +                old_ram_size * EXCL_BITMAP_CELL_SZ,
> +                new_ram_size * EXCL_BITMAP_CELL_SZ);
>      }

I'm wondering is old/new_ram_size should be renamed to
old/new_ram_pages?

So as I understand it we have created a bitmap which has
EXCL_BITMAP_CELL_SZ bits per page.

>      cpu_physical_memory_set_dirty_range(new_block->offset,
>                                          new_block->used_length,
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 0f07159..2782c77 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -19,7 +19,8 @@
>  #define DIRTY_MEMORY_VGA       0
>  #define DIRTY_MEMORY_CODE      1
>  #define DIRTY_MEMORY_MIGRATION 2
> -#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
> +#define DIRTY_MEMORY_EXCLUSIVE 3
> +#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
>
>  #include <stdint.h>
>  #include <stdbool.h>
> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
> index 7115154..b48af27 100644
> --- a/include/exec/ram_addr.h
> +++ b/include/exec/ram_addr.h
> @@ -21,6 +21,7 @@
>
>  #ifndef CONFIG_USER_ONLY
>  #include "hw/xen/xen.h"
> +#include "sysemu/sysemu.h"
>
>  struct RAMBlock {
>      struct rcu_head rcu;
> @@ -82,6 +83,13 @@ int qemu_ram_resize(ram_addr_t base, ram_addr_t newsize, Error **errp);
>  #define DIRTY_CLIENTS_ALL     ((1 << DIRTY_MEMORY_NUM) - 1)
>  #define DIRTY_CLIENTS_NOCODE  (DIRTY_CLIENTS_ALL & ~(1 << DIRTY_MEMORY_CODE))
>
> +/* Exclusive bitmap support. */
> +#define EXCL_BITMAP_CELL_SZ 8
> +#define EXCL_BITMAP_GET_BIT_OFFSET(addr) \
> +        (EXCL_BITMAP_CELL_SZ * (addr >> TARGET_PAGE_BITS))
> +#define EXCL_BITMAP_GET_BYTE_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
> +#define EXCL_IDX(cpu) (cpu % EXCL_BITMAP_CELL_SZ)
> +

I think some of the explanation of what CELL_SZ means from your commit
message needs to go here.

>  static inline bool cpu_physical_memory_get_dirty(ram_addr_t start,
>                                                   ram_addr_t length,
>                                                   unsigned client)
> @@ -173,6 +181,11 @@ static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
>      if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
>          bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
>      }
> +    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
> +        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE],
> +                        page * EXCL_BITMAP_CELL_SZ,
> +                        (end - page) * EXCL_BITMAP_CELL_SZ);
> +    }
>      xen_modified_memory(start, length);
>  }
>
> @@ -288,5 +301,68 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
>  }
>
>  void migration_bitmap_extend(ram_addr_t old, ram_addr_t new);
> +
> +/* One cell for each page. The n-th bit of a cell describes all the i-th vCPUs
> + * such that (i % EXCL_BITMAP_CELL_SZ) == n.
> + * A bit set to zero ensures that all the vCPUs described by the bit have the
> + * EXCL_BIT set for the page. */
> +static inline void cpu_physical_memory_unset_excl(ram_addr_t addr, uint32_t cpu)
> +{
> +    set_bit_atomic(EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu),
> +            ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
> +}
> +
> +/* Return true if there is at least one cpu with the EXCL bit set for the page
> + * of @addr. */
> +static inline int cpu_physical_memory_atleast_one_excl(ram_addr_t addr)
> +{
> +    uint8_t *bitmap;
> +
> +    bitmap = (uint8_t *)(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
> +
> +    /* This is safe even if smp_cpus < 8 since the unused bits are always 1. */
> +    return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] != UCHAR_MAX;
> +}

So I'm getting a little confused as to why we need EXCL_BITMAP_CELL_SZ
bits rather that just one bit shared amongst the CPUs. What's so special
about 8 if we could say have a 16 cpu system?

Ultimately if any page has an exclusive operation going on all vCPUs are
forced down the slow-path for that pages access? Where is the benefit in
splitting that bitfield across these vCPUs if the only test is "is at
least one vCPU doing an excl right now?".

> +
> +/* Return true if the @cpu has the bit set (not exclusive) for the page of
> + * @addr.  If @cpu == smp_cpus return true if at least one vCPU has the dirty
> + * bit set for that page. */
> +static inline int cpu_physical_memory_not_excl(ram_addr_t addr,
> +                                               unsigned long cpu)
> +{
> +    uint8_t *bitmap;
> +
> +    bitmap = (uint8_t *)ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
> +
> +    if (cpu == smp_cpus) {
> +        if (smp_cpus >= EXCL_BITMAP_CELL_SZ) {
> +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)];
> +        } else {
> +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] &
> +                                            ((1 << smp_cpus) - 1);
> +        }
> +    } else {
> +        return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] & (1 << EXCL_IDX(cpu));
> +    }
> +}
> +
> +/* Clean the dirty bit of @cpu (i.e. set the page as exclusive). If @cpu ==
> + * smp_cpus clean the dirty bit for all the vCPUs. */
> +static inline int cpu_physical_memory_set_excl(ram_addr_t addr, uint32_t cpu)
> +{
> +    if (cpu == smp_cpus) {
> +        int nr = (smp_cpus >= EXCL_BITMAP_CELL_SZ) ?
> +                            EXCL_BITMAP_CELL_SZ : smp_cpus;
> +
> +        return bitmap_test_and_clear_atomic(
> +                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
> +                        EXCL_BITMAP_GET_BIT_OFFSET(addr), nr);
> +    } else {
> +        return bitmap_test_and_clear_atomic(
> +                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
> +                        EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu), 1);
> +    }
> +}
> +
>  #endif
>  #endif

Maybe this will get clearer as I read on but at the moment the bitfield
split confuses me.

--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list
  2015-12-18 13:18   ` Alex Bennée
@ 2015-12-18 13:47     ` alvise rigo
  0 siblings, 0 replies; 60+ messages in thread
From: alvise rigo @ 2015-12-18 13:47 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

On Fri, Dec 18, 2015 at 2:18 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> The purpose of this new bitmap is to flag the memory pages that are in
>> the middle of LL/SC operations (after a LL, before a SC) on a per-vCPU
>> basis.
>> For all these pages, the corresponding TLB entries will be generated
>> in such a way to force the slow-path if at least one vCPU has the bit
>> not set.
>> When the system starts, the whole memory is dirty (all the bitmap is
>> set). A page, after being marked as exclusively-clean, will be
>> restored as dirty after the SC.
>>
>> For each page we keep 8 bits to be shared among all the vCPUs available
>> in the system. In general, the to the vCPU n correspond the bit n % 8.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  exec.c                  |  8 ++++--
>>  include/exec/memory.h   |  3 +-
>>  include/exec/ram_addr.h | 76 +++++++++++++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 84 insertions(+), 3 deletions(-)
>>
>> diff --git a/exec.c b/exec.c
>> index 0bf0a6e..e66d232 100644
>> --- a/exec.c
>> +++ b/exec.c
>> @@ -1548,11 +1548,15 @@ static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
>>          int i;
>>
>>          /* ram_list.dirty_memory[] is protected by the iothread lock.  */
>> -        for (i = 0; i < DIRTY_MEMORY_NUM; i++) {
>> +        for (i = 0; i < DIRTY_MEMORY_EXCLUSIVE; i++) {
>>              ram_list.dirty_memory[i] =
>>                  bitmap_zero_extend(ram_list.dirty_memory[i],
>>                                     old_ram_size, new_ram_size);
>> -       }
>> +        }
>> +        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE] = bitmap_zero_extend(
>> +                ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
>> +                old_ram_size * EXCL_BITMAP_CELL_SZ,
>> +                new_ram_size * EXCL_BITMAP_CELL_SZ);
>>      }
>
> I'm wondering is old/new_ram_size should be renamed to
> old/new_ram_pages?

Yes, I think it make more sense.

>
> So as I understand it we have created a bitmap which has
> EXCL_BITMAP_CELL_SZ bits per page.
>
>>      cpu_physical_memory_set_dirty_range(new_block->offset,
>>                                          new_block->used_length,
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 0f07159..2782c77 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -19,7 +19,8 @@
>>  #define DIRTY_MEMORY_VGA       0
>>  #define DIRTY_MEMORY_CODE      1
>>  #define DIRTY_MEMORY_MIGRATION 2
>> -#define DIRTY_MEMORY_NUM       3        /* num of dirty bits */
>> +#define DIRTY_MEMORY_EXCLUSIVE 3
>> +#define DIRTY_MEMORY_NUM       4        /* num of dirty bits */
>>
>>  #include <stdint.h>
>>  #include <stdbool.h>
>> diff --git a/include/exec/ram_addr.h b/include/exec/ram_addr.h
>> index 7115154..b48af27 100644
>> --- a/include/exec/ram_addr.h
>> +++ b/include/exec/ram_addr.h
>> @@ -21,6 +21,7 @@
>>
>>  #ifndef CONFIG_USER_ONLY
>>  #include "hw/xen/xen.h"
>> +#include "sysemu/sysemu.h"
>>
>>  struct RAMBlock {
>>      struct rcu_head rcu;
>> @@ -82,6 +83,13 @@ int qemu_ram_resize(ram_addr_t base, ram_addr_t newsize, Error **errp);
>>  #define DIRTY_CLIENTS_ALL     ((1 << DIRTY_MEMORY_NUM) - 1)
>>  #define DIRTY_CLIENTS_NOCODE  (DIRTY_CLIENTS_ALL & ~(1 << DIRTY_MEMORY_CODE))
>>
>> +/* Exclusive bitmap support. */
>> +#define EXCL_BITMAP_CELL_SZ 8
>> +#define EXCL_BITMAP_GET_BIT_OFFSET(addr) \
>> +        (EXCL_BITMAP_CELL_SZ * (addr >> TARGET_PAGE_BITS))
>> +#define EXCL_BITMAP_GET_BYTE_OFFSET(addr) (addr >> TARGET_PAGE_BITS)
>> +#define EXCL_IDX(cpu) (cpu % EXCL_BITMAP_CELL_SZ)
>> +
>
> I think some of the explanation of what CELL_SZ means from your commit
> message needs to go here.

OK.

>
>>  static inline bool cpu_physical_memory_get_dirty(ram_addr_t start,
>>                                                   ram_addr_t length,
>>                                                   unsigned client)
>> @@ -173,6 +181,11 @@ static inline void cpu_physical_memory_set_dirty_range(ram_addr_t start,
>>      if (unlikely(mask & (1 << DIRTY_MEMORY_CODE))) {
>>          bitmap_set_atomic(d[DIRTY_MEMORY_CODE], page, end - page);
>>      }
>> +    if (unlikely(mask & (1 << DIRTY_MEMORY_EXCLUSIVE))) {
>> +        bitmap_set_atomic(d[DIRTY_MEMORY_EXCLUSIVE],
>> +                        page * EXCL_BITMAP_CELL_SZ,
>> +                        (end - page) * EXCL_BITMAP_CELL_SZ);
>> +    }
>>      xen_modified_memory(start, length);
>>  }
>>
>> @@ -288,5 +301,68 @@ uint64_t cpu_physical_memory_sync_dirty_bitmap(unsigned long *dest,
>>  }
>>
>>  void migration_bitmap_extend(ram_addr_t old, ram_addr_t new);
>> +
>> +/* One cell for each page. The n-th bit of a cell describes all the i-th vCPUs
>> + * such that (i % EXCL_BITMAP_CELL_SZ) == n.
>> + * A bit set to zero ensures that all the vCPUs described by the bit have the
>> + * EXCL_BIT set for the page. */
>> +static inline void cpu_physical_memory_unset_excl(ram_addr_t addr, uint32_t cpu)
>> +{
>> +    set_bit_atomic(EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu),
>> +            ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
>> +}
>> +
>> +/* Return true if there is at least one cpu with the EXCL bit set for the page
>> + * of @addr. */
>> +static inline int cpu_physical_memory_atleast_one_excl(ram_addr_t addr)
>> +{
>> +    uint8_t *bitmap;
>> +
>> +    bitmap = (uint8_t *)(ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE]);
>> +
>> +    /* This is safe even if smp_cpus < 8 since the unused bits are always 1. */
>> +    return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] != UCHAR_MAX;
>> +}
>
> So I'm getting a little confused as to why we need EXCL_BITMAP_CELL_SZ
> bits rather that just one bit shared amongst the CPUs. What's so special
> about 8 if we could say have a 16 cpu system?

Here the ideal solution would be to have one bit per CPU, but since
this is too expensive, we group them in cluster of 8.
Of course the value 8 could be changed, but I would not go much higher
since DIRTY_MEMORY_EXCLUSIVE is already 8 times the size of the other
bitmaps.

>
> Ultimately if any page has an exclusive operation going on all vCPUs are
> forced down the slow-path for that pages access? Where is the benefit in
> splitting that bitfield across these vCPUs if the only test is "is at
> least one vCPU doing an excl right now?".

Clustering the CPUs in this way, we make the flushing procedure a bit
more efficient since we perform TLB flushes at a cluster granularity.
In practice, when handling a LL operation we query the flush only to
those clusters with the bit "dirty".

Regards,
avise

>
>> +
>> +/* Return true if the @cpu has the bit set (not exclusive) for the page of
>> + * @addr.  If @cpu == smp_cpus return true if at least one vCPU has the dirty
>> + * bit set for that page. */
>> +static inline int cpu_physical_memory_not_excl(ram_addr_t addr,
>> +                                               unsigned long cpu)
>> +{
>> +    uint8_t *bitmap;
>> +
>> +    bitmap = (uint8_t *)ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE];
>> +
>> +    if (cpu == smp_cpus) {
>> +        if (smp_cpus >= EXCL_BITMAP_CELL_SZ) {
>> +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)];
>> +        } else {
>> +            return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] &
>> +                                            ((1 << smp_cpus) - 1);
>> +        }
>> +    } else {
>> +        return bitmap[EXCL_BITMAP_GET_BYTE_OFFSET(addr)] & (1 << EXCL_IDX(cpu));
>> +    }
>> +}
>> +
>> +/* Clean the dirty bit of @cpu (i.e. set the page as exclusive). If @cpu ==
>> + * smp_cpus clean the dirty bit for all the vCPUs. */
>> +static inline int cpu_physical_memory_set_excl(ram_addr_t addr, uint32_t cpu)
>> +{
>> +    if (cpu == smp_cpus) {
>> +        int nr = (smp_cpus >= EXCL_BITMAP_CELL_SZ) ?
>> +                            EXCL_BITMAP_CELL_SZ : smp_cpus;
>> +
>> +        return bitmap_test_and_clear_atomic(
>> +                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
>> +                        EXCL_BITMAP_GET_BIT_OFFSET(addr), nr);
>> +    } else {
>> +        return bitmap_test_and_clear_atomic(
>> +                        ram_list.dirty_memory[DIRTY_MEMORY_EXCLUSIVE],
>> +                        EXCL_BITMAP_GET_BIT_OFFSET(addr) + EXCL_IDX(cpu), 1);
>> +    }
>> +}
>> +
>>  #endif
>>  #endif
>
> Maybe this will get clearer as I read on but at the moment the bitfield
> split confuses me.
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag Alvise Rigo
@ 2016-01-05 16:10   ` Alex Bennée
  2016-01-05 17:27     ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2016-01-05 16:10 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Add a new TLB flag to force all the accesses made to a page to follow
> the slow-path.
>
> In the case we remove a TLB entry marked as EXCL, we unset the
> corresponding exclusive bit in the bitmap.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cputlb.c                |  38 +++++++++++++++-
>  include/exec/cpu-all.h  |   8 ++++
>  include/exec/cpu-defs.h |   1 +
>  include/qom/cpu.h       |  14 ++++++
>  softmmu_template.h      | 114 ++++++++++++++++++++++++++++++++++++++----------
>  5 files changed, 152 insertions(+), 23 deletions(-)
>
> diff --git a/cputlb.c b/cputlb.c
> index bf1d50a..7ee0c89 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -394,6 +394,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>      env->tlb_v_table[mmu_idx][vidx] = *te;
>      env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
>
> +    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write &
> TLB_EXCL))) {

Why do we care about TLB_MMIO flags here? Does it actually happen? Would
bad things happen if we enforced exclusivity for an MMIO write? Do the
other flags matter?

There should be a comment as to why MMIO is mentioned I think.

> +        /* We are removing an exclusive entry, set the page to dirty. This
> +         * is not be necessary if the vCPU has performed both SC and LL. */
> +        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
> +                                          (te->addr_write & TARGET_PAGE_MASK);
> +        if (!cpu->ll_sc_context) {
> +            cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
> +        }
> +    }
> +
>      /* refill the tlb */
>      env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
>      env->iotlb[mmu_idx][index].attrs = attrs;
> @@ -419,7 +429,15 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>                                                     + xlat)) {
>              te->addr_write = address | TLB_NOTDIRTY;
>          } else {
> -            te->addr_write = address;
> +            if (!(address & TLB_MMIO) &&
> +                cpu_physical_memory_atleast_one_excl(section->mr->ram_addr
> +                                                           + xlat)) {
> +                /* There is at least one vCPU that has flagged the address as
> +                 * exclusive. */
> +                te->addr_write = address | TLB_EXCL;
> +            } else {
> +                te->addr_write = address;
> +            }
>          }
>      } else {
>          te->addr_write = -1;
> @@ -471,6 +489,24 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>      return qemu_ram_addr_from_host_nofail(p);
>  }
>
> +/* For every vCPU compare the exclusive address and reset it in case of a
> + * match. Since only one vCPU is running at once, no lock has to be held to
> + * guard this operation. */
> +static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
> +{
> +    CPUState *cpu;
> +
> +    CPU_FOREACH(cpu) {
> +        if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
> +            ranges_overlap(cpu->excl_protected_range.begin,
> +                           cpu->excl_protected_range.end -
> +                           cpu->excl_protected_range.begin,
> +                           addr, size)) {
> +            cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
> +        }
> +    }
> +}
> +
>  #define MMUSUFFIX _mmu
>
>  #define SHIFT 0
> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> index 83b1781..f8d8feb 100644
> --- a/include/exec/cpu-all.h
> +++ b/include/exec/cpu-all.h
> @@ -277,6 +277,14 @@ CPUArchState *cpu_copy(CPUArchState *env);
>  #define TLB_NOTDIRTY    (1 << 4)
>  /* Set if TLB entry is an IO callback.  */
>  #define TLB_MMIO        (1 << 5)
> +/* Set if TLB entry references a page that requires exclusive access.  */
> +#define TLB_EXCL        (1 << 6)
> +
> +/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
> + * above. */
> +#if TLB_EXCL >= TARGET_PAGE_SIZE
> +#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
> +#endif
>
>  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
>  void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
> index 5093be2..b34d7ae 100644
> --- a/include/exec/cpu-defs.h
> +++ b/include/exec/cpu-defs.h
> @@ -27,6 +27,7 @@
>  #include <inttypes.h>
>  #include "qemu/osdep.h"
>  #include "qemu/queue.h"
> +#include "qemu/range.h"
>  #include "tcg-target.h"
>  #ifndef CONFIG_USER_ONLY
>  #include "exec/hwaddr.h"
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index 51a1323..c6bb6b6 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -29,6 +29,7 @@
>  #include "qemu/queue.h"
>  #include "qemu/thread.h"
>  #include "qemu/typedefs.h"
> +#include "qemu/range.h"
>
>  typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
>                                       void *opaque);
> @@ -210,6 +211,9 @@ struct kvm_run;
>  #define TB_JMP_CACHE_BITS 12
>  #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
>
> +/* Atomic insn translation TLB support. */
> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
> +
>  /**
>   * CPUState:
>   * @cpu_index: CPU index (informative).
> @@ -329,6 +333,16 @@ struct CPUState {
>       */
>      bool throttle_thread_scheduled;
>
> +    /* Used by the atomic insn translation backend. */
> +    int ll_sc_context;
> +    /* vCPU current exclusive addresses range.
> +     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
> +     * in the middle of a LL/SC. */
> +    struct Range excl_protected_range;
> +    /* Used to carry the SC result but also to flag a normal (legacy)
> +     * store access made by a stcond (see softmmu_template.h). */
> +    int excl_succeeded;

It might be clearer if excl_succeeded was defined as a bool?

>      /* Note that this is accessed at the start of every TB via a negative
>         offset from AREG0.  Leave this field at the end so as to make the
>         (absolute value) offset as small as possible.  This reduces code
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 6803890..24d29b7 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -395,19 +395,54 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      }
>
> -    /* Handle an IO access.  */
> +    /* Handle an IO access or exclusive access.  */
>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> -        CPUIOTLBEntry *iotlbentry;
> -        if ((addr & (DATA_SIZE - 1)) != 0) {
> -            goto do_unaligned_access;
> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> +
> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
> +            CPUState *cpu = ENV_GET_CPU(env);
> +            /* The slow-path has been forced since we are writing to
> +             * exclusive-protected memory. */
> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
> +
> +            /* The function lookup_and_reset_cpus_ll_addr could have reset the
> +             * exclusive address. Fail the SC in this case.
> +             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
> +             * not been called by a softmmu_llsc_template.h. */

Could this be better worded (along with bool-ising) as:

"excl_succeeded is set by helper_le_st_name (softmmu_llsc_template)."

But having said that grepping for helper_le_st_name I see that's defined
in softmmu_template.h so now the comments has confused me.

It also might be worth mentioning the subtly that exclusive addresses
are based on the real hwaddr (hence the iotlb lookup?).

> +            if (cpu->excl_succeeded) {
> +                if (cpu->excl_protected_range.begin != hw_addr) {
> +                    /* The vCPU is SC-ing to an unprotected address. */
> +                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
> +                    cpu->excl_succeeded = 0;

cpu->excl_succeeded = false;

> +
> +                    return;
> +                }
> +
> +                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
> +            }
> +
> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +        #if DATA_SIZE == 1
> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
> +        #else
> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
> +        #endif

Why the special casing for byte access? Isn't this something the glue +
SUFFIX magic is meant to sort out?

> +
> +            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
> +
> +            return;
> +        } else {
> +            if ((addr & (DATA_SIZE - 1)) != 0) {
> +                goto do_unaligned_access;
> +            }
> +            iotlbentry = &env->iotlb[mmu_idx][index];

Are we re-loading the TLB entry here?

> +
> +            /* ??? Note that the io helpers always read data in the target
> +               byte ordering.  We should push the LE/BE request down into io.  */
> +            val = TGT_LE(val);
> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr,
> retaddr);

What happens if the software does and exclusive operation on a io
address?

> +            return;
>          }
> -        iotlbentry = &env->iotlb[mmu_idx][index];
> -
> -        /* ??? Note that the io helpers always read data in the target
> -           byte ordering.  We should push the LE/BE request down into io.  */
> -        val = TGT_LE(val);
> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> -        return;
>      }
>
>      /* Handle slow unaligned access (it spans two pages or IO).  */
> @@ -475,19 +510,54 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>      }
>
> -    /* Handle an IO access.  */
> +    /* Handle an IO access or exclusive access.  */

Hmm there looks like a massive amount of duplication (not your fault, it
was like that when you got here ;-) but maybe this can be re-factored
away somehow?

>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
> -        CPUIOTLBEntry *iotlbentry;
> -        if ((addr & (DATA_SIZE - 1)) != 0) {
> -            goto do_unaligned_access;
> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> +
> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
> +            CPUState *cpu = ENV_GET_CPU(env);
> +            /* The slow-path has been forced since we are writing to
> +             * exclusive-protected memory. */
> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
> +
> +            /* The function lookup_and_reset_cpus_ll_addr could have reset the
> +             * exclusive address. Fail the SC in this case.
> +             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
> +             * not been called by a softmmu_llsc_template.h. */
> +            if (cpu->excl_succeeded) {
> +                if (cpu->excl_protected_range.begin != hw_addr) {
> +                    /* The vCPU is SC-ing to an unprotected address. */
> +                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
> +                    cpu->excl_succeeded = 0;
> +
> +                    return;
> +                }
> +
> +                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
> +            }
> +
> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
> +        #if DATA_SIZE == 1
> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
> +        #else
> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
> +        #endif
> +
> +            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
> +
> +            return;
> +        } else {
> +            if ((addr & (DATA_SIZE - 1)) != 0) {
> +                goto do_unaligned_access;
> +            }
> +            iotlbentry = &env->iotlb[mmu_idx][index];
> +
> +            /* ??? Note that the io helpers always read data in the target
> +               byte ordering.  We should push the LE/BE request down into io.  */
> +            val = TGT_BE(val);
> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> +            return;
>          }
> -        iotlbentry = &env->iotlb[mmu_idx][index];
> -
> -        /* ??? Note that the io helpers always read data in the target
> -           byte ordering.  We should push the LE/BE request down into io.  */
> -        val = TGT_BE(val);
> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> -        return;
>      }
>
>      /* Handle slow unaligned access (it spans two pages or IO).  */


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 03/14] Add CPUClass hook to set exclusive range
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 03/14] Add CPUClass hook to set exclusive range Alvise Rigo
@ 2016-01-05 16:42   ` Alex Bennée
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bennée @ 2016-01-05 16:42 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Allow each architecture to set the exclusive range at any LoadLink
> operation through a CPUClass hook.

nit: space or continue paragraph.

> This comes in handy to emulate, for instance, the exclusive monitor
> implemented in some ARM architectures (more precisely, the Exclusive
> Reservation Granule).
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>

Reviewed-by: Alex Bennée <alex.bennee@linaro.org>

> ---
>  include/qom/cpu.h | 4 ++++
>  qom/cpu.c         | 7 +++++++
>  2 files changed, 11 insertions(+)
>
> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
> index c6bb6b6..9e409ce 100644
> --- a/include/qom/cpu.h
> +++ b/include/qom/cpu.h
> @@ -175,6 +175,10 @@ typedef struct CPUClass {
>      void (*cpu_exec_exit)(CPUState *cpu);
>      bool (*cpu_exec_interrupt)(CPUState *cpu, int interrupt_request);
>
> +    /* Atomic instruction handling */
> +    void (*cpu_set_excl_protected_range)(CPUState *cpu, hwaddr addr,
> +                                         hwaddr size);
> +
>      void (*disas_set_info)(CPUState *cpu, disassemble_info *info);
>  } CPUClass;
>
> diff --git a/qom/cpu.c b/qom/cpu.c
> index fb80d13..a5c25a8 100644
> --- a/qom/cpu.c
> +++ b/qom/cpu.c
> @@ -203,6 +203,12 @@ static bool cpu_common_exec_interrupt(CPUState *cpu, int int_req)
>      return false;
>  }
>
> +static void cpu_common_set_excl_range(CPUState *cpu, hwaddr addr, hwaddr size)
> +{
> +    cpu->excl_protected_range.begin = addr;
> +    cpu->excl_protected_range.end = addr + size;
> +}
> +
>  void cpu_dump_state(CPUState *cpu, FILE *f, fprintf_function cpu_fprintf,
>                      int flags)
>  {
> @@ -355,6 +361,7 @@ static void cpu_class_init(ObjectClass *klass, void *data)
>      k->cpu_exec_enter = cpu_common_noop;
>      k->cpu_exec_exit = cpu_common_noop;
>      k->cpu_exec_interrupt = cpu_common_exec_interrupt;
> +    k->cpu_set_excl_protected_range = cpu_common_set_excl_range;
>      dc->realize = cpu_common_realizefn;
>      /*
>       * Reason: CPUs still need special care by board code: wiring up


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag
  2016-01-05 16:10   ` Alex Bennée
@ 2016-01-05 17:27     ` alvise rigo
  2016-01-05 18:39       ` Alex Bennée
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2016-01-05 17:27 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

On Tue, Jan 5, 2016 at 5:10 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Add a new TLB flag to force all the accesses made to a page to follow
>> the slow-path.
>>
>> In the case we remove a TLB entry marked as EXCL, we unset the
>> corresponding exclusive bit in the bitmap.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  cputlb.c                |  38 +++++++++++++++-
>>  include/exec/cpu-all.h  |   8 ++++
>>  include/exec/cpu-defs.h |   1 +
>>  include/qom/cpu.h       |  14 ++++++
>>  softmmu_template.h      | 114 ++++++++++++++++++++++++++++++++++++++----------
>>  5 files changed, 152 insertions(+), 23 deletions(-)
>>
>> diff --git a/cputlb.c b/cputlb.c
>> index bf1d50a..7ee0c89 100644
>> --- a/cputlb.c
>> +++ b/cputlb.c
>> @@ -394,6 +394,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>>      env->tlb_v_table[mmu_idx][vidx] = *te;
>>      env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
>>
>> +    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write &
>> TLB_EXCL))) {
>
> Why do we care about TLB_MMIO flags here? Does it actually happen? Would
> bad things happen if we enforced exclusivity for an MMIO write? Do the
> other flags matter?

In the previous version of the patch series it came out that the
accesses to MMIO regions have to be supported since, for instance, the
GDB stub relies on them.
The last two patches actually finalize the MMIO support.

>
> There should be a comment as to why MMIO is mentioned I think.

OK.

>
>> +        /* We are removing an exclusive entry, set the page to dirty. This
>> +         * is not be necessary if the vCPU has performed both SC and LL. */
>> +        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
>> +                                          (te->addr_write & TARGET_PAGE_MASK);
>> +        if (!cpu->ll_sc_context) {
>> +            cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
>> +        }
>> +    }
>> +
>>      /* refill the tlb */
>>      env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
>>      env->iotlb[mmu_idx][index].attrs = attrs;
>> @@ -419,7 +429,15 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>>                                                     + xlat)) {
>>              te->addr_write = address | TLB_NOTDIRTY;
>>          } else {
>> -            te->addr_write = address;
>> +            if (!(address & TLB_MMIO) &&
>> +                cpu_physical_memory_atleast_one_excl(section->mr->ram_addr
>> +                                                           + xlat)) {
>> +                /* There is at least one vCPU that has flagged the address as
>> +                 * exclusive. */
>> +                te->addr_write = address | TLB_EXCL;
>> +            } else {
>> +                te->addr_write = address;
>> +            }
>>          }
>>      } else {
>>          te->addr_write = -1;
>> @@ -471,6 +489,24 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>>      return qemu_ram_addr_from_host_nofail(p);
>>  }
>>
>> +/* For every vCPU compare the exclusive address and reset it in case of a
>> + * match. Since only one vCPU is running at once, no lock has to be held to
>> + * guard this operation. */
>> +static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>> +{
>> +    CPUState *cpu;
>> +
>> +    CPU_FOREACH(cpu) {
>> +        if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>> +            ranges_overlap(cpu->excl_protected_range.begin,
>> +                           cpu->excl_protected_range.end -
>> +                           cpu->excl_protected_range.begin,
>> +                           addr, size)) {
>> +            cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>> +        }
>> +    }
>> +}
>> +
>>  #define MMUSUFFIX _mmu
>>
>>  #define SHIFT 0
>> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
>> index 83b1781..f8d8feb 100644
>> --- a/include/exec/cpu-all.h
>> +++ b/include/exec/cpu-all.h
>> @@ -277,6 +277,14 @@ CPUArchState *cpu_copy(CPUArchState *env);
>>  #define TLB_NOTDIRTY    (1 << 4)
>>  /* Set if TLB entry is an IO callback.  */
>>  #define TLB_MMIO        (1 << 5)
>> +/* Set if TLB entry references a page that requires exclusive access.  */
>> +#define TLB_EXCL        (1 << 6)
>> +
>> +/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
>> + * above. */
>> +#if TLB_EXCL >= TARGET_PAGE_SIZE
>> +#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
>> +#endif
>>
>>  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
>>  void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
>> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
>> index 5093be2..b34d7ae 100644
>> --- a/include/exec/cpu-defs.h
>> +++ b/include/exec/cpu-defs.h
>> @@ -27,6 +27,7 @@
>>  #include <inttypes.h>
>>  #include "qemu/osdep.h"
>>  #include "qemu/queue.h"
>> +#include "qemu/range.h"
>>  #include "tcg-target.h"
>>  #ifndef CONFIG_USER_ONLY
>>  #include "exec/hwaddr.h"
>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>> index 51a1323..c6bb6b6 100644
>> --- a/include/qom/cpu.h
>> +++ b/include/qom/cpu.h
>> @@ -29,6 +29,7 @@
>>  #include "qemu/queue.h"
>>  #include "qemu/thread.h"
>>  #include "qemu/typedefs.h"
>> +#include "qemu/range.h"
>>
>>  typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
>>                                       void *opaque);
>> @@ -210,6 +211,9 @@ struct kvm_run;
>>  #define TB_JMP_CACHE_BITS 12
>>  #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
>>
>> +/* Atomic insn translation TLB support. */
>> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>> +
>>  /**
>>   * CPUState:
>>   * @cpu_index: CPU index (informative).
>> @@ -329,6 +333,16 @@ struct CPUState {
>>       */
>>      bool throttle_thread_scheduled;
>>
>> +    /* Used by the atomic insn translation backend. */
>> +    int ll_sc_context;
>> +    /* vCPU current exclusive addresses range.
>> +     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
>> +     * in the middle of a LL/SC. */
>> +    struct Range excl_protected_range;
>> +    /* Used to carry the SC result but also to flag a normal (legacy)
>> +     * store access made by a stcond (see softmmu_template.h). */
>> +    int excl_succeeded;
>
> It might be clearer if excl_succeeded was defined as a bool?

Yes, that might be a good idea.

>
>>      /* Note that this is accessed at the start of every TB via a negative
>>         offset from AREG0.  Leave this field at the end so as to make the
>>         (absolute value) offset as small as possible.  This reduces code
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 6803890..24d29b7 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -395,19 +395,54 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>      }
>>
>> -    /* Handle an IO access.  */
>> +    /* Handle an IO access or exclusive access.  */
>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>> -        CPUIOTLBEntry *iotlbentry;
>> -        if ((addr & (DATA_SIZE - 1)) != 0) {
>> -            goto do_unaligned_access;
>> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
>> +
>> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
>> +            CPUState *cpu = ENV_GET_CPU(env);
>> +            /* The slow-path has been forced since we are writing to
>> +             * exclusive-protected memory. */
>> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>> +
>> +            /* The function lookup_and_reset_cpus_ll_addr could have reset the
>> +             * exclusive address. Fail the SC in this case.
>> +             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
>> +             * not been called by a softmmu_llsc_template.h. */
>
> Could this be better worded (along with bool-ising) as:
>
> "excl_succeeded is set by helper_le_st_name (softmmu_llsc_template)."
>
> But having said that grepping for helper_le_st_name I see that's defined
> in softmmu_template.h so now the comments has confused me.

I see now that the comment refers to softmmu_llsc_template that will
be created later on. Please consider this fixed.
In any case excl_succeeded, as the name suggests, is used by
helper_stcond_name to know if the exclusive access went well.
However, it is also used by softmmu_template to know whether we came
from softmmu_llsc_template or not. This behaviour is pointed out in a
comment is softmmu_llsc_template.

>
> It also might be worth mentioning the subtly that exclusive addresses
> are based on the real hwaddr (hence the iotlb lookup?).

OK.

>
>> +            if (cpu->excl_succeeded) {
>> +                if (cpu->excl_protected_range.begin != hw_addr) {
>> +                    /* The vCPU is SC-ing to an unprotected address. */
>> +                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>> +                    cpu->excl_succeeded = 0;
>
> cpu->excl_succeeded = false;
>
>> +
>> +                    return;
>> +                }
>> +
>> +                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
>> +            }
>> +
>> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> +        #if DATA_SIZE == 1
>> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>> +        #else
>> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>> +        #endif
>
> Why the special casing for byte access? Isn't this something the glue +
> SUFFIX magic is meant to sort out?

For the byte access the byte ordering is irrelevant, in fact there is
only one version of stb_p.

>
>> +
>> +            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
>> +
>> +            return;
>> +        } else {
>> +            if ((addr & (DATA_SIZE - 1)) != 0) {
>> +                goto do_unaligned_access;
>> +            }
>> +            iotlbentry = &env->iotlb[mmu_idx][index];
>
> Are we re-loading the TLB entry here?

Indeed, that should not be there (anyhow it will go away when we
refactor this helper in patches 10,11,12).

>
>> +
>> +            /* ??? Note that the io helpers always read data in the target
>> +               byte ordering.  We should push the LE/BE request down into io.  */
>> +            val = TGT_LE(val);
>> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr,
>> retaddr);
>
> What happens if the software does and exclusive operation on a io
> address?

At this stage of the patch series such operations are not supported.
Should I add an hw_error in case a software tries to do that? As
written above, patches 13 and 14 add the missing pieces to support
exclusive operations to MMIO regions.

>
>> +            return;
>>          }
>> -        iotlbentry = &env->iotlb[mmu_idx][index];
>> -
>> -        /* ??? Note that the io helpers always read data in the target
>> -           byte ordering.  We should push the LE/BE request down into io.  */
>> -        val = TGT_LE(val);
>> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> -        return;
>>      }
>>
>>      /* Handle slow unaligned access (it spans two pages or IO).  */
>> @@ -475,19 +510,54 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>      }
>>
>> -    /* Handle an IO access.  */
>> +    /* Handle an IO access or exclusive access.  */
>
> Hmm there looks like a massive amount of duplication (not your fault, it
> was like that when you got here ;-) but maybe this can be re-factored
> away somehow?

That's true. This is why patches 10,11,12 try to alleviate this
problem by making the code a little bit more compact and readable.

Regards,
alvise

>
>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>> -        CPUIOTLBEntry *iotlbentry;
>> -        if ((addr & (DATA_SIZE - 1)) != 0) {
>> -            goto do_unaligned_access;
>> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
>> +
>> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
>> +            CPUState *cpu = ENV_GET_CPU(env);
>> +            /* The slow-path has been forced since we are writing to
>> +             * exclusive-protected memory. */
>> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>> +
>> +            /* The function lookup_and_reset_cpus_ll_addr could have reset the
>> +             * exclusive address. Fail the SC in this case.
>> +             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
>> +             * not been called by a softmmu_llsc_template.h. */
>> +            if (cpu->excl_succeeded) {
>> +                if (cpu->excl_protected_range.begin != hw_addr) {
>> +                    /* The vCPU is SC-ing to an unprotected address. */
>> +                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>> +                    cpu->excl_succeeded = 0;
>> +
>> +                    return;
>> +                }
>> +
>> +                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
>> +            }
>> +
>> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
>> +        #if DATA_SIZE == 1
>> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>> +        #else
>> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>> +        #endif
>> +
>> +            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
>> +
>> +            return;
>> +        } else {
>> +            if ((addr & (DATA_SIZE - 1)) != 0) {
>> +                goto do_unaligned_access;
>> +            }
>> +            iotlbentry = &env->iotlb[mmu_idx][index];
>> +
>> +            /* ??? Note that the io helpers always read data in the target
>> +               byte ordering.  We should push the LE/BE request down into io.  */
>> +            val = TGT_BE(val);
>> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> +            return;
>>          }
>> -        iotlbentry = &env->iotlb[mmu_idx][index];
>> -
>> -        /* ??? Note that the io helpers always read data in the target
>> -           byte ordering.  We should push the LE/BE request down into io.  */
>> -        val = TGT_BE(val);
>> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> -        return;
>>      }
>>
>>      /* Handle slow unaligned access (it spans two pages or IO).  */
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag
  2016-01-05 17:27     ` alvise rigo
@ 2016-01-05 18:39       ` Alex Bennée
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bennée @ 2016-01-05 18:39 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson


alvise rigo <a.rigo@virtualopensystems.com> writes:

> On Tue, Jan 5, 2016 at 5:10 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>>
>>> Add a new TLB flag to force all the accesses made to a page to follow
>>> the slow-path.
>>>
>>> In the case we remove a TLB entry marked as EXCL, we unset the
>>> corresponding exclusive bit in the bitmap.
>>>
>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  cputlb.c                |  38 +++++++++++++++-
>>>  include/exec/cpu-all.h  |   8 ++++
>>>  include/exec/cpu-defs.h |   1 +
>>>  include/qom/cpu.h       |  14 ++++++
>>>  softmmu_template.h      | 114 ++++++++++++++++++++++++++++++++++++++----------
>>>  5 files changed, 152 insertions(+), 23 deletions(-)
>>>
>>> diff --git a/cputlb.c b/cputlb.c
>>> index bf1d50a..7ee0c89 100644
>>> --- a/cputlb.c
>>> +++ b/cputlb.c
>>> @@ -394,6 +394,16 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>>>      env->tlb_v_table[mmu_idx][vidx] = *te;
>>>      env->iotlb_v[mmu_idx][vidx] = env->iotlb[mmu_idx][index];
>>>
>>> +    if (unlikely(!(te->addr_write & TLB_MMIO) && (te->addr_write &
>>> TLB_EXCL))) {
>>
>> Why do we care about TLB_MMIO flags here? Does it actually happen? Would
>> bad things happen if we enforced exclusivity for an MMIO write? Do the
>> other flags matter?
>
> In the previous version of the patch series it came out that the
> accesses to MMIO regions have to be supported since, for instance, the
> GDB stub relies on them.
> The last two patches actually finalize the MMIO support.
>
>>
>> There should be a comment as to why MMIO is mentioned I think.
>
> OK.
>
>>
>>> +        /* We are removing an exclusive entry, set the page to dirty. This
>>> +         * is not be necessary if the vCPU has performed both SC and LL. */
>>> +        hwaddr hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) +
>>> +                                          (te->addr_write & TARGET_PAGE_MASK);
>>> +        if (!cpu->ll_sc_context) {
>>> +            cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
>>> +        }
>>> +    }
>>> +
>>>      /* refill the tlb */
>>>      env->iotlb[mmu_idx][index].addr = iotlb - vaddr;
>>>      env->iotlb[mmu_idx][index].attrs = attrs;
>>> @@ -419,7 +429,15 @@ void tlb_set_page_with_attrs(CPUState *cpu, target_ulong vaddr,
>>>                                                     + xlat)) {
>>>              te->addr_write = address | TLB_NOTDIRTY;
>>>          } else {
>>> -            te->addr_write = address;
>>> +            if (!(address & TLB_MMIO) &&
>>> +                cpu_physical_memory_atleast_one_excl(section->mr->ram_addr
>>> +                                                           + xlat)) {
>>> +                /* There is at least one vCPU that has flagged the address as
>>> +                 * exclusive. */
>>> +                te->addr_write = address | TLB_EXCL;
>>> +            } else {
>>> +                te->addr_write = address;
>>> +            }
>>>          }
>>>      } else {
>>>          te->addr_write = -1;
>>> @@ -471,6 +489,24 @@ tb_page_addr_t get_page_addr_code(CPUArchState *env1, target_ulong addr)
>>>      return qemu_ram_addr_from_host_nofail(p);
>>>  }
>>>
>>> +/* For every vCPU compare the exclusive address and reset it in case of a
>>> + * match. Since only one vCPU is running at once, no lock has to be held to
>>> + * guard this operation. */
>>> +static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>>> +{
>>> +    CPUState *cpu;
>>> +
>>> +    CPU_FOREACH(cpu) {
>>> +        if (cpu->excl_protected_range.begin != EXCLUSIVE_RESET_ADDR &&
>>> +            ranges_overlap(cpu->excl_protected_range.begin,
>>> +                           cpu->excl_protected_range.end -
>>> +                           cpu->excl_protected_range.begin,
>>> +                           addr, size)) {
>>> +            cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>>> +        }
>>> +    }
>>> +}
>>> +
>>>  #define MMUSUFFIX _mmu
>>>
>>>  #define SHIFT 0
>>> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
>>> index 83b1781..f8d8feb 100644
>>> --- a/include/exec/cpu-all.h
>>> +++ b/include/exec/cpu-all.h
>>> @@ -277,6 +277,14 @@ CPUArchState *cpu_copy(CPUArchState *env);
>>>  #define TLB_NOTDIRTY    (1 << 4)
>>>  /* Set if TLB entry is an IO callback.  */
>>>  #define TLB_MMIO        (1 << 5)
>>> +/* Set if TLB entry references a page that requires exclusive access.  */
>>> +#define TLB_EXCL        (1 << 6)
>>> +
>>> +/* Do not allow a TARGET_PAGE_MASK which covers one or more bits defined
>>> + * above. */
>>> +#if TLB_EXCL >= TARGET_PAGE_SIZE
>>> +#error TARGET_PAGE_MASK covering the low bits of the TLB virtual address
>>> +#endif
>>>
>>>  void dump_exec_info(FILE *f, fprintf_function cpu_fprintf);
>>>  void dump_opcount_info(FILE *f, fprintf_function cpu_fprintf);
>>> diff --git a/include/exec/cpu-defs.h b/include/exec/cpu-defs.h
>>> index 5093be2..b34d7ae 100644
>>> --- a/include/exec/cpu-defs.h
>>> +++ b/include/exec/cpu-defs.h
>>> @@ -27,6 +27,7 @@
>>>  #include <inttypes.h>
>>>  #include "qemu/osdep.h"
>>>  #include "qemu/queue.h"
>>> +#include "qemu/range.h"
>>>  #include "tcg-target.h"
>>>  #ifndef CONFIG_USER_ONLY
>>>  #include "exec/hwaddr.h"
>>> diff --git a/include/qom/cpu.h b/include/qom/cpu.h
>>> index 51a1323..c6bb6b6 100644
>>> --- a/include/qom/cpu.h
>>> +++ b/include/qom/cpu.h
>>> @@ -29,6 +29,7 @@
>>>  #include "qemu/queue.h"
>>>  #include "qemu/thread.h"
>>>  #include "qemu/typedefs.h"
>>> +#include "qemu/range.h"
>>>
>>>  typedef int (*WriteCoreDumpFunction)(const void *buf, size_t size,
>>>                                       void *opaque);
>>> @@ -210,6 +211,9 @@ struct kvm_run;
>>>  #define TB_JMP_CACHE_BITS 12
>>>  #define TB_JMP_CACHE_SIZE (1 << TB_JMP_CACHE_BITS)
>>>
>>> +/* Atomic insn translation TLB support. */
>>> +#define EXCLUSIVE_RESET_ADDR ULLONG_MAX
>>> +
>>>  /**
>>>   * CPUState:
>>>   * @cpu_index: CPU index (informative).
>>> @@ -329,6 +333,16 @@ struct CPUState {
>>>       */
>>>      bool throttle_thread_scheduled;
>>>
>>> +    /* Used by the atomic insn translation backend. */
>>> +    int ll_sc_context;
>>> +    /* vCPU current exclusive addresses range.
>>> +     * The address is set to EXCLUSIVE_RESET_ADDR if the vCPU is not.
>>> +     * in the middle of a LL/SC. */
>>> +    struct Range excl_protected_range;
>>> +    /* Used to carry the SC result but also to flag a normal (legacy)
>>> +     * store access made by a stcond (see softmmu_template.h). */
>>> +    int excl_succeeded;
>>
>> It might be clearer if excl_succeeded was defined as a bool?
>
> Yes, that might be a good idea.
>
>>
>>>      /* Note that this is accessed at the start of every TB via a negative
>>>         offset from AREG0.  Leave this field at the end so as to make the
>>>         (absolute value) offset as small as possible.  This reduces code
>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>> index 6803890..24d29b7 100644
>>> --- a/softmmu_template.h
>>> +++ b/softmmu_template.h
>>> @@ -395,19 +395,54 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>>      }
>>>
>>> -    /* Handle an IO access.  */
>>> +    /* Handle an IO access or exclusive access.  */
>>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>> -        CPUIOTLBEntry *iotlbentry;
>>> -        if ((addr & (DATA_SIZE - 1)) != 0) {
>>> -            goto do_unaligned_access;
>>> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
>>> +
>>> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
>>> +            CPUState *cpu = ENV_GET_CPU(env);
>>> +            /* The slow-path has been forced since we are writing to
>>> +             * exclusive-protected memory. */
>>> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>>> +
>>> +            /* The function lookup_and_reset_cpus_ll_addr could have reset the
>>> +             * exclusive address. Fail the SC in this case.
>>> +             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
>>> +             * not been called by a softmmu_llsc_template.h. */
>>
>> Could this be better worded (along with bool-ising) as:
>>
>> "excl_succeeded is set by helper_le_st_name (softmmu_llsc_template)."
>>
>> But having said that grepping for helper_le_st_name I see that's defined
>> in softmmu_template.h so now the comments has confused me.
>
> I see now that the comment refers to softmmu_llsc_template that will
> be created later on. Please consider this fixed.
> In any case excl_succeeded, as the name suggests, is used by
> helper_stcond_name to know if the exclusive access went well.
> However, it is also used by softmmu_template to know whether we came
> from softmmu_llsc_template or not. This behaviour is pointed out in a
> comment is softmmu_llsc_template.
>
>>
>> It also might be worth mentioning the subtly that exclusive addresses
>> are based on the real hwaddr (hence the iotlb lookup?).
>
> OK.
>
>>
>>> +            if (cpu->excl_succeeded) {
>>> +                if (cpu->excl_protected_range.begin != hw_addr) {
>>> +                    /* The vCPU is SC-ing to an unprotected address. */
>>> +                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>>> +                    cpu->excl_succeeded = 0;
>>
>> cpu->excl_succeeded = false;
>>
>>> +
>>> +                    return;
>>> +                }
>>> +
>>> +                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
>>> +            }
>>> +
>>> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>> +        #if DATA_SIZE == 1
>>> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>>> +        #else
>>> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>>> +        #endif
>>
>> Why the special casing for byte access? Isn't this something the glue +
>> SUFFIX magic is meant to sort out?
>
> For the byte access the byte ordering is irrelevant, in fact there is
> only one version of stb_p.

I'm just wondering why this little detail isn't hidden in the
st_SUFFIX_le_p helpers rather than having to be explicit here.

>
>>
>>> +
>>> +            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
>>> +
>>> +            return;
>>> +        } else {
>>> +            if ((addr & (DATA_SIZE - 1)) != 0) {
>>> +                goto do_unaligned_access;
>>> +            }
>>> +            iotlbentry = &env->iotlb[mmu_idx][index];
>>
>> Are we re-loading the TLB entry here?
>
> Indeed, that should not be there (anyhow it will go away when we
> refactor this helper in patches 10,11,12).
>
>>
>>> +
>>> +            /* ??? Note that the io helpers always read data in the target
>>> +               byte ordering.  We should push the LE/BE request down into io.  */
>>> +            val = TGT_LE(val);
>>> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr,
>>> retaddr);
>>
>> What happens if the software does and exclusive operation on a io
>> address?
>
> At this stage of the patch series such operations are not supported.
> Should I add an hw_error in case a software tries to do that?

If it is a known limitation then I think it might be worth it.

> As
> written above, patches 13 and 14 add the missing pieces to support
> exclusive operations to MMIO regions.
>
>>
>>> +            return;
>>>          }
>>> -        iotlbentry = &env->iotlb[mmu_idx][index];
>>> -
>>> -        /* ??? Note that the io helpers always read data in the target
>>> -           byte ordering.  We should push the LE/BE request down into io.  */
>>> -        val = TGT_LE(val);
>>> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>> -        return;
>>>      }
>>>
>>>      /* Handle slow unaligned access (it spans two pages or IO).  */
>>> @@ -475,19 +510,54 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>          tlb_addr = env->tlb_table[mmu_idx][index].addr_write;
>>>      }
>>>
>>> -    /* Handle an IO access.  */
>>> +    /* Handle an IO access or exclusive access.  */
>>
>> Hmm there looks like a massive amount of duplication (not your fault, it
>> was like that when you got here ;-) but maybe this can be re-factored
>> away somehow?
>
> That's true. This is why patches 10,11,12 try to alleviate this
> problem by making the code a little bit more compact and readable.

Although I was actually comparing the two helpers on the final tree
state and there was loads of duplication still. If there is general
re-factoring outside of the LL/SC case it would be worth having them at
the start of the patch series so those improvements can get pulled into
the tree and the LL/SC code is simpler when applied.

>
> Regards,
> alvise
>
>>
>>>      if (unlikely(tlb_addr & ~TARGET_PAGE_MASK)) {
>>> -        CPUIOTLBEntry *iotlbentry;
>>> -        if ((addr & (DATA_SIZE - 1)) != 0) {
>>> -            goto do_unaligned_access;
>>> +        CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
>>> +
>>> +        if ((tlb_addr & ~TARGET_PAGE_MASK) == TLB_EXCL) {
>>> +            CPUState *cpu = ENV_GET_CPU(env);
>>> +            /* The slow-path has been forced since we are writing to
>>> +             * exclusive-protected memory. */
>>> +            hwaddr hw_addr = (iotlbentry->addr & TARGET_PAGE_MASK) + addr;
>>> +
>>> +            /* The function lookup_and_reset_cpus_ll_addr could have reset the
>>> +             * exclusive address. Fail the SC in this case.
>>> +             * N.B.: Here excl_succeeded == 0 means that helper_le_st_name has
>>> +             * not been called by a softmmu_llsc_template.h. */
>>> +            if (cpu->excl_succeeded) {
>>> +                if (cpu->excl_protected_range.begin != hw_addr) {
>>> +                    /* The vCPU is SC-ing to an unprotected address. */
>>> +                    cpu->excl_protected_range.begin = EXCLUSIVE_RESET_ADDR;
>>> +                    cpu->excl_succeeded = 0;
>>> +
>>> +                    return;
>>> +                }
>>> +
>>> +                cpu_physical_memory_unset_excl(hw_addr, cpu->cpu_index);
>>> +            }
>>> +
>>> +            haddr = addr + env->tlb_table[mmu_idx][index].addend;
>>> +        #if DATA_SIZE == 1
>>> +            glue(glue(st, SUFFIX), _p)((uint8_t *)haddr, val);
>>> +        #else
>>> +            glue(glue(st, SUFFIX), _le_p)((uint8_t *)haddr, val);
>>> +        #endif
>>> +
>>> +            lookup_and_reset_cpus_ll_addr(hw_addr, DATA_SIZE);
>>> +
>>> +            return;
>>> +        } else {
>>> +            if ((addr & (DATA_SIZE - 1)) != 0) {
>>> +                goto do_unaligned_access;
>>> +            }
>>> +            iotlbentry = &env->iotlb[mmu_idx][index];
>>> +
>>> +            /* ??? Note that the io helpers always read data in the target
>>> +               byte ordering.  We should push the LE/BE request down into io.  */
>>> +            val = TGT_BE(val);
>>> +            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>> +            return;
>>>          }
>>> -        iotlbentry = &env->iotlb[mmu_idx][index];
>>> -
>>> -        /* ??? Note that the io helpers always read data in the target
>>> -           byte ordering.  We should push the LE/BE request down into io.  */
>>> -        val = TGT_BE(val);
>>> -        glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>>> -        return;
>>>      }
>>>
>>>      /* Handle slow unaligned access (it spans two pages or IO).  */
>>
>>
>> --
>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 04/14] softmmu: Add helpers for a new slowpath
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 04/14] softmmu: Add helpers for a new slowpath Alvise Rigo
@ 2016-01-06 15:16   ` Alex Bennée
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bennée @ 2016-01-06 15:16 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> The new helpers rely on the legacy ones to perform the actual read/write.
>
> The LoadLink helper (helper_ldlink_name) prepares the way for the
> following SC operation. It sets the linked address and the size of the
> access.

nit: extra line or continue paragraph

> These helper also update the TLB entry of the page involved in the
> LL/SC for those vCPUs that have the bit set (dirty), so that the
> following accesses made by all the vCPUs will follow the slow path.
>
> The StoreConditional helper (helper_stcond_name) returns 1 if the
> store has to fail due to a concurrent access to the same page by
> another vCPU. A 'concurrent access' can be a store made by *any* vCPU
> (although, some implementations allow stores made by the CPU that issued
> the LoadLink).
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  cputlb.c                |   3 ++
>  softmmu_llsc_template.h | 134 ++++++++++++++++++++++++++++++++++++++++++++++++
>  softmmu_template.h      |  12 +++++
>  tcg/tcg.h               |  31 +++++++++++
>  4 files changed, 180 insertions(+)
>  create mode 100644 softmmu_llsc_template.h
>
> diff --git a/cputlb.c b/cputlb.c
> index 7ee0c89..70b6404 100644
> --- a/cputlb.c
> +++ b/cputlb.c
> @@ -509,6 +509,8 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>
>  #define MMUSUFFIX _mmu
>
> +/* Generates LoadLink/StoreConditional helpers in softmmu_template.h */
> +#define GEN_EXCLUSIVE_HELPERS
>  #define SHIFT 0
>  #include "softmmu_template.h"
>
> @@ -521,6 +523,7 @@ static inline void lookup_and_reset_cpus_ll_addr(hwaddr addr, hwaddr size)
>  #define SHIFT 3
>  #include "softmmu_template.h"
>  #undef MMUSUFFIX
> +#undef GEN_EXCLUSIVE_HELPERS
>
>  #define MMUSUFFIX _cmmu
>  #undef GETPC_ADJ
> diff --git a/softmmu_llsc_template.h b/softmmu_llsc_template.h
> new file mode 100644
> index 0000000..586bb2e
> --- /dev/null
> +++ b/softmmu_llsc_template.h
> @@ -0,0 +1,134 @@
> +/*
> + *  Software MMU support (esclusive load/store operations)
> + *
> + * Generate helpers used by TCG for qemu_ldlink/stcond ops.
> + *
> + * Included from softmmu_template.h only.
> + *
> + * Copyright (c) 2015 Virtual Open Systems
> + *
> + * Authors:
> + *  Alvise Rigo <a.rigo@virtualopensystems.com>
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Public
> + * License along with this library; if not, see <http://www.gnu.org/licenses/>.
> + */
> +
> +/* This template does not generate together the le and be version, but only one
> + * of the two depending on whether BIGENDIAN_EXCLUSIVE_HELPERS has been set.
> + * The same nomenclature as softmmu_template.h is used for the exclusive
> + * helpers.  */
> +
> +#ifdef BIGENDIAN_EXCLUSIVE_HELPERS
> +
> +#define helper_ldlink_name  glue(glue(helper_be_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_be_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld glue(glue(helper_be_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st glue(glue(helper_be_st, SUFFIX), MMUSUFFIX)
> +
> +#else /* LE helpers + 8bit helpers (generated only once for both LE end BE) */
> +
> +#if DATA_SIZE > 1
> +#define helper_ldlink_name  glue(glue(helper_le_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_le_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld glue(glue(helper_le_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st glue(glue(helper_le_st, SUFFIX), MMUSUFFIX)
> +#else /* DATA_SIZE <= 1 */
> +#define helper_ldlink_name  glue(glue(helper_ret_ldlink, USUFFIX), MMUSUFFIX)
> +#define helper_stcond_name  glue(glue(helper_ret_stcond, SUFFIX), MMUSUFFIX)
> +#define helper_ld glue(glue(helper_ret_ld, USUFFIX), MMUSUFFIX)
> +#define helper_st glue(glue(helper_ret_st, SUFFIX), MMUSUFFIX)
> +#endif
> +
> +#endif
> +
> +WORD_TYPE helper_ldlink_name(CPUArchState *env, target_ulong addr,
> +                                TCGMemOpIdx oi, uintptr_t retaddr)
> +{
> +    WORD_TYPE ret;
> +    int index;
> +    CPUState *cpu, *this = ENV_GET_CPU(env);
> +    CPUClass *cc = CPU_GET_CLASS(this);
> +    hwaddr hw_addr;
> +    unsigned mmu_idx = get_mmuidx(oi);
> +
> +    /* Use the proper load helper from cpu_ldst.h */
> +    ret = helper_ld(env, addr, mmu_idx, retaddr);
> +
> +    index = (addr >> TARGET_PAGE_BITS) & (CPU_TLB_SIZE - 1);
> +
> +    /* hw_addr = hwaddr of the page (i.e. section->mr->ram_addr + xlat)
> +     * plus the offset (i.e. addr & ~TARGET_PAGE_MASK) */
> +    hw_addr = (env->iotlb[mmu_idx][index].addr & TARGET_PAGE_MASK) + addr;
> +
> +    cpu_physical_memory_set_excl(hw_addr, this->cpu_index);
> +    /* If all the vCPUs have the EXCL bit set for this page there is no need
> +     * to request any flush. */
> +    if (cpu_physical_memory_not_excl(hw_addr, smp_cpus)) {

Having done a double take reading this I think maybe a different helper
for the smp_cpus case would be useful, maybe cpu_any_memory_not_excl().

> +        CPU_FOREACH(cpu) {
> +            if (current_cpu != cpu) {

At this point this == current_cpu? As we are skipping because we have
already done the work above maybe we could be consistent and use this
(this_cpu?) or current_cpu.

> +                if (cpu_physical_memory_not_excl(hw_addr, cpu->cpu_index)) {
> +                    cpu_physical_memory_set_excl(hw_addr, cpu->cpu_index);
> +                    tlb_flush(cpu, 1);
> +                }
> +            }
> +        }
> +    }
> +
> +    cc->cpu_set_excl_protected_range(this, hw_addr, DATA_SIZE);
> +
> +    /* For this vCPU, just update the TLB entry, no need to flush. */
> +    env->tlb_table[mmu_idx][index].addr_write |= TLB_EXCL;
> +
> +    /* From now on we are in LL/SC context */
> +    this->ll_sc_context = 1;

apropos my previous emails, ll_sc_context should probably be a bool as well.

> +
> +    return ret;
> +}
> +
> +WORD_TYPE helper_stcond_name(CPUArchState *env, target_ulong addr,
> +                             DATA_TYPE val, TCGMemOpIdx oi,
> +                             uintptr_t retaddr)
> +{
> +    WORD_TYPE ret;
> +    unsigned mmu_idx = get_mmuidx(oi);
> +    CPUState *cpu = ENV_GET_CPU(env);
> +
> +    if (!cpu->ll_sc_context) {
> +        cpu->excl_succeeded = 0;
> +        ret = 1;
> +    } else {
> +        /* We set it preventively to true to distinguish the following legacy
> +         * access as one made by the store conditional wrapper. If the store
> +         * conditional does not succeed, the value will be set to 0.*/
> +        cpu->excl_succeeded = 1;
> +        helper_st(env, addr, val, mmu_idx, retaddr);
> +
> +        if (cpu->excl_succeeded) {
> +            cpu->excl_succeeded = 0;
> +            ret = 0;
> +        } else {
> +            ret = 1;
> +        }
> +
> +        /* Unset LL/SC context */
> +        cpu->ll_sc_context = 0;

excl_succeeded and ll_sc_context should use bool values.

> +    }
> +
> +    return ret;
> +}
> +
> +#undef helper_ldlink_name
> +#undef helper_stcond_name
> +#undef helper_ld
> +#undef helper_st
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 24d29b7..d3d5902 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -620,6 +620,18 @@ void probe_write(CPUArchState *env, target_ulong addr, int mmu_idx,
>  #endif
>  #endif /* !defined(SOFTMMU_CODE_ACCESS) */
>
> +#ifdef GEN_EXCLUSIVE_HELPERS
> +
> +#if DATA_SIZE > 1 /* The 8-bit helpers are generate along with LE helpers */
> +#define BIGENDIAN_EXCLUSIVE_HELPERS
> +#include "softmmu_llsc_template.h"
> +#undef BIGENDIAN_EXCLUSIVE_HELPERS
> +#endif
> +
> +#include "softmmu_llsc_template.h"
> +
> +#endif /* !defined(GEN_EXCLUSIVE_HELPERS) */
> +
>  #undef READ_ACCESS_TYPE
>  #undef SHIFT
>  #undef DATA_TYPE
> diff --git a/tcg/tcg.h b/tcg/tcg.h
> index a696922..3e050a4 100644
> --- a/tcg/tcg.h
> +++ b/tcg/tcg.h
> @@ -968,6 +968,21 @@ tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr,
>                                      TCGMemOpIdx oi, uintptr_t retaddr);
>  uint64_t helper_be_ldq_mmu(CPUArchState *env, target_ulong addr,
>                             TCGMemOpIdx oi, uintptr_t retaddr);
> +/* Exclusive variants */
> +tcg_target_ulong helper_ret_ldlinkub_mmu(CPUArchState *env, target_ulong addr,
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_le_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_le_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
> +uint64_t helper_le_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_be_ldlinkuw_mmu(CPUArchState *env, target_ulong addr,
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_be_ldlinkul_mmu(CPUArchState *env, target_ulong addr,
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
> +uint64_t helper_be_ldlinkq_mmu(CPUArchState *env, target_ulong addr,
> +                                            TCGMemOpIdx oi, uintptr_t retaddr);
>
>  /* Value sign-extended to tcg register size.  */
>  tcg_target_ulong helper_ret_ldsb_mmu(CPUArchState *env, target_ulong addr,
> @@ -1010,6 +1025,22 @@ uint32_t helper_be_ldl_cmmu(CPUArchState *env, target_ulong addr,
>                              TCGMemOpIdx oi, uintptr_t retaddr);
>  uint64_t helper_be_ldq_cmmu(CPUArchState *env, target_ulong addr,
>                              TCGMemOpIdx oi, uintptr_t retaddr);
> +/* Exclusive variants */
> +tcg_target_ulong helper_ret_stcondb_mmu(CPUArchState *env, target_ulong addr,
> +                            uint8_t val, TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_le_stcondw_mmu(CPUArchState *env, target_ulong addr,
> +                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_le_stcondl_mmu(CPUArchState *env, target_ulong addr,
> +                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
> +uint64_t helper_le_stcondq_mmu(CPUArchState *env, target_ulong addr,
> +                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_be_stcondw_mmu(CPUArchState *env, target_ulong addr,
> +                            uint16_t val, TCGMemOpIdx oi, uintptr_t retaddr);
> +tcg_target_ulong helper_be_stcondl_mmu(CPUArchState *env, target_ulong addr,
> +                            uint32_t val, TCGMemOpIdx oi, uintptr_t retaddr);
> +uint64_t helper_be_stcondq_mmu(CPUArchState *env, target_ulong addr,
> +                            uint64_t val, TCGMemOpIdx oi, uintptr_t retaddr);
> +
>
>  /* Temporary aliases until backends are converted.  */
>  #ifdef TARGET_WORDS_BIGENDIAN


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 07/14] target-arm: translate: Use ld/st excl for atomic insns
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 07/14] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
@ 2016-01-06 17:11   ` Alex Bennée
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bennée @ 2016-01-06 17:11 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Use the new LL/SC runtime helpers to handle the ARM atomic
> instructions in softmmu_llsc_template.h.
>
> In general, the helper generator
> gen_helper_{ldlink,stcond}_aa32_i{8,16,32,64}() calls the function
> helper_{le,be}_{ldlink,stcond}{ub,uw,ulq}_mmu() implemented in
> softmmu_llsc_template.h.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  target-arm/translate.c | 101 +++++++++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 97 insertions(+), 4 deletions(-)
>
> diff --git a/target-arm/translate.c b/target-arm/translate.c
> index 5d22879..e88d8a3 100644
> --- a/target-arm/translate.c
> +++ b/target-arm/translate.c
> @@ -64,8 +64,10 @@ TCGv_ptr cpu_env;
>  static TCGv_i64 cpu_V0, cpu_V1, cpu_M0;
>  static TCGv_i32 cpu_R[16];
>  TCGv_i32 cpu_CF, cpu_NF, cpu_VF, cpu_ZF;
> +#ifndef CONFIG_TCG_USE_LDST_EXCL
>  TCGv_i64 cpu_exclusive_addr;
>  TCGv_i64 cpu_exclusive_val;
> +#endif
>  #ifdef CONFIG_USER_ONLY
>  TCGv_i64 cpu_exclusive_test;
>  TCGv_i32 cpu_exclusive_info;
> @@ -98,10 +100,12 @@ void arm_translate_init(void)
>      cpu_VF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, VF), "VF");
>      cpu_ZF = tcg_global_mem_new_i32(TCG_AREG0, offsetof(CPUARMState, ZF), "ZF");
>
> +#ifndef CONFIG_TCG_USE_LDST_EXCL
>      cpu_exclusive_addr = tcg_global_mem_new_i64(TCG_AREG0,
>          offsetof(CPUARMState, exclusive_addr), "exclusive_addr");
>      cpu_exclusive_val = tcg_global_mem_new_i64(TCG_AREG0,
>          offsetof(CPUARMState, exclusive_val), "exclusive_val");
> +#endif
>  #ifdef CONFIG_USER_ONLY
>      cpu_exclusive_test = tcg_global_mem_new_i64(TCG_AREG0,
>          offsetof(CPUARMState, exclusive_test), "exclusive_test");
> @@ -7414,15 +7418,59 @@ static void gen_logicq_cc(TCGv_i32 lo, TCGv_i32 hi)
>      tcg_gen_or_i32(cpu_ZF, lo, hi);
>  }
>
> -/* Load/Store exclusive instructions are implemented by remembering
> +/* If the softmmu is enabled, the translation of Load/Store exclusive
> + * instructions will rely on the gen_helper_{ldlink,stcond} helpers,
> + * offloading most of the work to the softmmu_llsc_template.h functions.
> +
> +   Otherwise, these instructions are implemented by remembering
>     the value/address loaded, and seeing if these are the same
>     when the store is performed. This should be sufficient to implement
>     the architecturally mandated semantics, and avoids having to monitor
>     regular stores.
>
> -   In system emulation mode only one CPU will be running at once, so
> -   this sequence is effectively atomic.  In user emulation mode we
> -   throw an exception and handle the atomic operation elsewhere.  */
> +   In user emulation mode we throw an exception and handle the atomic
> +   operation elsewhere.  */
> +#ifdef CONFIG_TCG_USE_LDST_EXCL
> +static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
> +                               TCGv_i32 addr, int size)
> + {
> +    TCGv_i32 tmp = tcg_temp_new_i32();
> +    TCGv_i32 mem_idx = tcg_temp_new_i32();
> +
> +    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
> +
> +    if (size != 3) {
> +        switch (size) {
> +        case 0:
> +            gen_helper_ldlink_aa32_i8(tmp, cpu_env, addr, mem_idx);
> +            break;
> +        case 1:
> +            gen_helper_ldlink_aa32_i16(tmp, cpu_env, addr, mem_idx);
> +            break;
> +        case 2:
> +            gen_helper_ldlink_aa32_i32(tmp, cpu_env, addr, mem_idx);
> +            break;
> +        default:
> +            abort();
> +        }
> +
> +        store_reg(s, rt, tmp);
> +    } else {
> +        TCGv_i64 tmp64 = tcg_temp_new_i64();
> +        TCGv_i32 tmph = tcg_temp_new_i32();
> +
> +        gen_helper_ldlink_aa32_i64(tmp64, cpu_env, addr, mem_idx);
> +        tcg_gen_extr_i64_i32(tmp, tmph, tmp64);
> +
> +        store_reg(s, rt, tmp);
> +        store_reg(s, rt2, tmph);
> +
> +        tcg_temp_free_i64(tmp64);
> +    }
> +
> +    tcg_temp_free_i32(mem_idx);
> +}
> +#else
>  static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>                                 TCGv_i32 addr, int size)
>  {
> @@ -7461,10 +7509,14 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>      store_reg(s, rt, tmp);
>      tcg_gen_extu_i32_i64(cpu_exclusive_addr, addr);
>  }
> +#endif
>
>  static void gen_clrex(DisasContext *s)
>  {
> +#ifdef CONFIG_TCG_USE_LDST_EXCL

I don't think it would be correct to ignore clrex in softmmu mode.
Assuming the code path had used it we may well be creating slow-path
transitions for no reason.

> +#else
>      tcg_gen_movi_i64(cpu_exclusive_addr, -1);
> +#endif
>  }
>
>  #ifdef CONFIG_USER_ONLY
> @@ -7476,6 +7528,47 @@ static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>                       size | (rd << 4) | (rt << 8) | (rt2 << 12));
>      gen_exception_internal_insn(s, 4, EXCP_STREX);
>  }
> +#elif defined CONFIG_TCG_USE_LDST_EXCL
> +static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
> +                                TCGv_i32 addr, int size)
> +{
> +    TCGv_i32 tmp, mem_idx;
> +
> +    mem_idx = tcg_temp_new_i32();
> +
> +    tcg_gen_movi_i32(mem_idx, get_mem_index(s));
> +    tmp = load_reg(s, rt);
> +
> +    if (size != 3) {
> +        switch (size) {
> +        case 0:
> +            gen_helper_stcond_aa32_i8(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
> +            break;
> +        case 1:
> +            gen_helper_stcond_aa32_i16(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
> +            break;
> +        case 2:
> +            gen_helper_stcond_aa32_i32(cpu_R[rd], cpu_env, addr, tmp, mem_idx);
> +            break;
> +        default:
> +            abort();
> +        }
> +    } else {
> +        TCGv_i64 tmp64;
> +        TCGv_i32 tmp2;
> +
> +        tmp64 = tcg_temp_new_i64();
> +        tmp2 = load_reg(s, rt2);
> +        tcg_gen_concat_i32_i64(tmp64, tmp, tmp2);
> +        gen_helper_stcond_aa32_i64(cpu_R[rd], cpu_env, addr, tmp64, mem_idx);
> +
> +        tcg_temp_free_i32(tmp2);
> +        tcg_temp_free_i64(tmp64);
> +    }
> +
> +    tcg_temp_free_i32(tmp);
> +    tcg_temp_free_i32(mem_idx);
> +}
>  #else
>  static void gen_store_exclusive(DisasContext *s, int rd, int rt, int rt2,
>                                  TCGv_i32 addr, int size)


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn Alvise Rigo
@ 2016-01-06 17:13   ` Alex Bennée
  2016-01-06 17:27     ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2016-01-06 17:13 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Add a simple helper function to emulate the CLREX instruction.

And now I see ;-)

I suspect this should be merged with the other helpers as a generic helper.

>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  target-arm/helper.h    | 2 ++
>  target-arm/op_helper.c | 6 ++++++
>  target-arm/translate.c | 1 +
>  3 files changed, 9 insertions(+)
>
> diff --git a/target-arm/helper.h b/target-arm/helper.h
> index c2a85c7..37cec49 100644
> --- a/target-arm/helper.h
> +++ b/target-arm/helper.h
> @@ -532,6 +532,8 @@ DEF_HELPER_2(dc_zva, void, env, i64)
>  DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>  DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>
> +DEF_HELPER_1(atomic_clear, void, env)
> +
>  #ifdef TARGET_AARCH64
>  #include "helper-a64.h"
>  #endif
> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
> index 6cd54c8..5a67557 100644
> --- a/target-arm/op_helper.c
> +++ b/target-arm/op_helper.c
> @@ -50,6 +50,12 @@ static int exception_target_el(CPUARMState *env)
>      return target_el;
>  }
>
> +void HELPER(atomic_clear)(CPUARMState *env)
> +{
> +    ENV_GET_CPU(env)->excl_protected_range.begin = -1;

Is there not a defined reset value EXCLUSIVE_RESET_ADDR we should use here?

> +    ENV_GET_CPU(env)->ll_sc_context = false;
> +}
> +
>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>                            uint32_t rn, uint32_t maxindex)
>  {
> diff --git a/target-arm/translate.c b/target-arm/translate.c
> index e88d8a3..e0362e0 100644
> --- a/target-arm/translate.c
> +++ b/target-arm/translate.c
> @@ -7514,6 +7514,7 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>  static void gen_clrex(DisasContext *s)
>  {
>  #ifdef CONFIG_TCG_USE_LDST_EXCL
> +    gen_helper_atomic_clear(cpu_env);
>  #else
>      tcg_gen_movi_i64(cpu_exclusive_addr, -1);
>  #endif


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn
  2016-01-06 17:13   ` Alex Bennée
@ 2016-01-06 17:27     ` alvise rigo
  0 siblings, 0 replies; 60+ messages in thread
From: alvise rigo @ 2016-01-06 17:27 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

On Wed, Jan 6, 2016 at 6:13 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Add a simple helper function to emulate the CLREX instruction.
>
> And now I see ;-)
>
> I suspect this should be merged with the other helpers as a generic helper.

Agreed.

>
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  target-arm/helper.h    | 2 ++
>>  target-arm/op_helper.c | 6 ++++++
>>  target-arm/translate.c | 1 +
>>  3 files changed, 9 insertions(+)
>>
>> diff --git a/target-arm/helper.h b/target-arm/helper.h
>> index c2a85c7..37cec49 100644
>> --- a/target-arm/helper.h
>> +++ b/target-arm/helper.h
>> @@ -532,6 +532,8 @@ DEF_HELPER_2(dc_zva, void, env, i64)
>>  DEF_HELPER_FLAGS_2(neon_pmull_64_lo, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>  DEF_HELPER_FLAGS_2(neon_pmull_64_hi, TCG_CALL_NO_RWG_SE, i64, i64, i64)
>>
>> +DEF_HELPER_1(atomic_clear, void, env)
>> +
>>  #ifdef TARGET_AARCH64
>>  #include "helper-a64.h"
>>  #endif
>> diff --git a/target-arm/op_helper.c b/target-arm/op_helper.c
>> index 6cd54c8..5a67557 100644
>> --- a/target-arm/op_helper.c
>> +++ b/target-arm/op_helper.c
>> @@ -50,6 +50,12 @@ static int exception_target_el(CPUARMState *env)
>>      return target_el;
>>  }
>>
>> +void HELPER(atomic_clear)(CPUARMState *env)
>> +{
>> +    ENV_GET_CPU(env)->excl_protected_range.begin = -1;
>
> Is there not a defined reset value EXCLUSIVE_RESET_ADDR we should use here?

Yes, I will move the EXCLUSIVE_RESET_ADDR definition somewhere else in
order to include it in this file.

>
>> +    ENV_GET_CPU(env)->ll_sc_context = false;
>> +}
>> +
>>  uint32_t HELPER(neon_tbl)(CPUARMState *env, uint32_t ireg, uint32_t def,
>>                            uint32_t rn, uint32_t maxindex)
>>  {
>> diff --git a/target-arm/translate.c b/target-arm/translate.c
>> index e88d8a3..e0362e0 100644
>> --- a/target-arm/translate.c
>> +++ b/target-arm/translate.c
>> @@ -7514,6 +7514,7 @@ static void gen_load_exclusive(DisasContext *s, int rt, int rt2,
>>  static void gen_clrex(DisasContext *s)
>>  {
>>  #ifdef CONFIG_TCG_USE_LDST_EXCL
>> +    gen_helper_atomic_clear(cpu_env);
>>  #else
>>      tcg_gen_movi_i64(cpu_exclusive_addr, -1);
>>  #endif
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
                   ` (16 preceding siblings ...)
  2015-12-17 16:06 ` Alex Bennée
@ 2016-01-06 18:00 ` Andrew Baumann
  2016-01-07 10:21   ` alvise rigo
  17 siblings, 1 reply; 60+ messages in thread
From: Andrew Baumann @ 2016-01-06 18:00 UTC (permalink / raw)
  To: Alvise Rigo, qemu-devel, mttcg
  Cc: claudio.fontana, pbonzini, jani.kokkonen, tech, alex.bennee, rth

Hi,

> From: qemu-devel-bounces+andrew.baumann=microsoft.com@nongnu.org
> [mailto:qemu-devel-
> bounces+andrew.baumann=microsoft.com@nongnu.org] On Behalf Of
> Alvise Rigo
> Sent: Monday, 14 December 2015 00:41
> 
> This is the sixth iteration of the patch series which applies to the
> upstream branch of QEMU (v2.5.0-rc3).
> 
> Changes versus previous versions are at the bottom of this cover letter.
> 
> The code is also available at following repository:
> https://git.virtualopensystems.com/dev/qemu-mt.git
> branch:
> slowpath-for-atomic-v6-no-mttcg
> 
> This patch series provides an infrastructure for atomic instruction
> implementation in QEMU, thus offering a 'legacy' solution for
> translating guest atomic instructions. Moreover, it can be considered as
> a first step toward a multi-thread TCG.
> 
> The underlying idea is to provide new TCG helpers (sort of softmmu
> helpers) that guarantee atomicity to some memory accesses or in general
> a way to define memory transactions.
> 
> More specifically, the new softmmu helpers behave as LoadLink and
> StoreConditional instructions, and are called from TCG code by means of
> target specific helpers. This work includes the implementation for all
> the ARM atomic instructions, see target-arm/op_helper.c.

As a heads up, we just added support for alignment checks in LDREX:
https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96

Hopefully it is an easy change to ensure that the same check happens for the relevant loads when CONFIG_TCG_USE_LDST_EXCL is enabled?

Thanks,
Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2016-01-06 18:00 ` Andrew Baumann
@ 2016-01-07 10:21   ` alvise rigo
  2016-01-07 10:22     ` Peter Maydell
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2016-01-07 10:21 UTC (permalink / raw)
  To: Andrew Baumann
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen,
	tech, alex.bennee, rth

Hi,

On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
<Andrew.Baumann@microsoft.com> wrote:
>
> Hi,
>
> > From: qemu-devel-bounces+andrew.baumann=microsoft.com@nongnu.org
> > [mailto:qemu-devel-
> > bounces+andrew.baumann=microsoft.com@nongnu.org] On Behalf Of
> > Alvise Rigo
> > Sent: Monday, 14 December 2015 00:41
> >
> > This is the sixth iteration of the patch series which applies to the
> > upstream branch of QEMU (v2.5.0-rc3).
> >
> > Changes versus previous versions are at the bottom of this cover letter.
> >
> > The code is also available at following repository:
> > https://git.virtualopensystems.com/dev/qemu-mt.git
> > branch:
> > slowpath-for-atomic-v6-no-mttcg
> >
> > This patch series provides an infrastructure for atomic instruction
> > implementation in QEMU, thus offering a 'legacy' solution for
> > translating guest atomic instructions. Moreover, it can be considered as
> > a first step toward a multi-thread TCG.
> >
> > The underlying idea is to provide new TCG helpers (sort of softmmu
> > helpers) that guarantee atomicity to some memory accesses or in general
> > a way to define memory transactions.
> >
> > More specifically, the new softmmu helpers behave as LoadLink and
> > StoreConditional instructions, and are called from TCG code by means of
> > target specific helpers. This work includes the implementation for all
> > the ARM atomic instructions, see target-arm/op_helper.c.
>
> As a heads up, we just added support for alignment checks in LDREX:
> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96

Thank you for the update.

>
> Hopefully it is an easy change to ensure that the same check happens for the relevant loads when CONFIG_TCG_USE_LDST_EXCL is enabled?

It should be if we add an aligned variant for each of the exclusive helper.
BTW, why don't we make the check also for the STREX instruction?

Regards,
alvise

>
> Thanks,
> Andrew

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2016-01-07 10:21   ` alvise rigo
@ 2016-01-07 10:22     ` Peter Maydell
  2016-01-07 10:49       ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Peter Maydell @ 2016-01-07 10:22 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, claudio.fontana, qemu-devel, Andrew Baumann, pbonzini,
	jani.kokkonen, tech, alex.bennee, rth

On 7 January 2016 at 10:21, alvise rigo <a.rigo@virtualopensystems.com> wrote:
> Hi,
>
> On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
> <Andrew.Baumann@microsoft.com> wrote:
>> As a heads up, we just added support for alignment checks in LDREX:
>> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96

> It should be if we add an aligned variant for each of the exclusive helper.
> BTW, why don't we make the check also for the STREX instruction?

Andrew's patch only changed the bits Windows cares about, I think.
We should indeed extend this to cover also STREX and the A64 instructions
as well, I think.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2016-01-07 10:22     ` Peter Maydell
@ 2016-01-07 10:49       ` alvise rigo
  2016-01-07 11:16         ` Peter Maydell
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2016-01-07 10:49 UTC (permalink / raw)
  To: Peter Maydell
  Cc: mttcg, claudio.fontana, qemu-devel, Andrew Baumann, pbonzini,
	jani.kokkonen, tech, alex.bennee, rth

On Thu, Jan 7, 2016 at 11:22 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 7 January 2016 at 10:21, alvise rigo <a.rigo@virtualopensystems.com> wrote:
>> Hi,
>>
>> On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
>> <Andrew.Baumann@microsoft.com> wrote:
>>> As a heads up, we just added support for alignment checks in LDREX:
>>> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96
>
>> It should be if we add an aligned variant for each of the exclusive helper.
>> BTW, why don't we make the check also for the STREX instruction?
>
> Andrew's patch only changed the bits Windows cares about, I think.
> We should indeed extend this to cover also STREX and the A64 instructions
> as well, I think.

The alignment check is easily doable in general. The only tricky part
I found is the A64's STXP instruction that requires quadword alignment
for the 64bit paired access.
In that case, the translation of the instruction will rely on a
aarch64-only helper. The alternative solution would be to extend
softmmu_template.h to generate 128bit accesses, but I don't believe
this is the right way to go.

Regards,
alvise

>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation
  2016-01-07 10:49       ` alvise rigo
@ 2016-01-07 11:16         ` Peter Maydell
  0 siblings, 0 replies; 60+ messages in thread
From: Peter Maydell @ 2016-01-07 11:16 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, claudio.fontana, qemu-devel, Andrew Baumann, pbonzini,
	jani.kokkonen, tech, alex.bennee, rth

On 7 January 2016 at 10:49, alvise rigo <a.rigo@virtualopensystems.com> wrote:
> On Thu, Jan 7, 2016 at 11:22 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> On 7 January 2016 at 10:21, alvise rigo <a.rigo@virtualopensystems.com> wrote:
>>> Hi,
>>>
>>> On Wed, Jan 6, 2016 at 7:00 PM, Andrew Baumann
>>> <Andrew.Baumann@microsoft.com> wrote:
>>>> As a heads up, we just added support for alignment checks in LDREX:
>>>> https://github.com/qemu/qemu/commit/30901475b91ef1f46304404ab4bfe89097f61b96
>>
>>> It should be if we add an aligned variant for each of the exclusive helper.
>>> BTW, why don't we make the check also for the STREX instruction?
>>
>> Andrew's patch only changed the bits Windows cares about, I think.
>> We should indeed extend this to cover also STREX and the A64 instructions
>> as well, I think.
>
> The alignment check is easily doable in general. The only tricky part
> I found is the A64's STXP instruction that requires quadword alignment
> for the 64bit paired access.
> In that case, the translation of the instruction will rely on a
> aarch64-only helper. The alternative solution would be to extend
> softmmu_template.h to generate 128bit accesses, but I don't believe
> this is the right way to go.

Yes, 128-bit alignment check is not currently easy. We should do
the others first and then think about the right approach for the
128 bit stuff. (I forget what rth's view about that was.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
@ 2016-01-07 14:46   ` Alex Bennée
  2016-01-07 15:09     ` alvise rigo
  2016-01-08 11:19   ` Alex Bennée
  1 sibling, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2016-01-07 14:46 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Attempting to simplify the helper_*_st_name, wrap the
> do_unaligned_access code into an inline function.
> Remove also the goto statement.

As I said in the other thread I think these sort of clean-ups can come
before the ll/sc implementations and potentially get merged ahead of the
rest of it.

>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  softmmu_template.h | 96 ++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 60 insertions(+), 36 deletions(-)
>
> diff --git a/softmmu_template.h b/softmmu_template.h
> index d3d5902..92f92b1 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>                                   iotlbentry->attrs);
>  }
>
> +static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
> +                                                           DATA_TYPE val,
> +                                                           target_ulong addr,
> +                                                           TCGMemOpIdx oi,
> +                                                           unsigned mmu_idx,
> +                                                           uintptr_t retaddr)
> +{
> +    int i;
> +
> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                             mmu_idx, retaddr);
> +    }
> +    /* XXX: not efficient, but simple */
> +    /* Note: relies on the fact that tlb_fill() does not remove the
> +     * previous page from the TLB cache.  */
> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
> +        /* Little-endian extract.  */
> +        uint8_t val8 = val >> (i * 8);
> +        /* Note the adjustment at the beginning of the function.
> +           Undo that for the recursion.  */
> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> +                                        oi, retaddr + GETPC_ADJ);
> +    }
> +}

There is still duplication of 99% of the code here which is silly given
the compiler inlines the code anyway. If we gave the helper a more
generic name and passed the endianess in via args I would hope the
compiler did the sensible thing and constant fold the code. Something
like:

static inline void glue(helper_generic_st_name, _do_unl_access)
                        (CPUArchState *env,
                        bool little_endian,
                        DATA_TYPE val,
                        target_ulong addr,
                        TCGMemOpIdx oi,
                        unsigned mmu_idx,
                        uintptr_t retaddr)
{
    int i;

    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
                             mmu_idx, retaddr);
    }
    /* Note: relies on the fact that tlb_fill() does not remove the
     * previous page from the TLB cache.  */
    for (i = DATA_SIZE - 1; i >= 0; i--) {
        if (little_endian) {
                /* Little-endian extract.  */
                uint8_t val8 = val >> (i * 8);
        } else {
                /* Big-endian extract.  */
                uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
        }
        /* Note the adjustment at the beginning of the function.
           Undo that for the recursion.  */
        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
                                        oi, retaddr + GETPC_ADJ);
    }
}


> +
>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
> @@ -433,7 +459,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>              return;
>          } else {
>              if ((addr & (DATA_SIZE - 1)) != 0) {
> -                goto do_unaligned_access;
> +                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> +                                                        oi, retaddr);
>              }
>              iotlbentry = &env->iotlb[mmu_idx][index];
>
> @@ -449,23 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      if (DATA_SIZE > 1
>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>                       >= TARGET_PAGE_SIZE)) {
> -        int i;
> -    do_unaligned_access:
> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                                 mmu_idx, retaddr);
> -        }
> -        /* XXX: not efficient, but simple */
> -        /* Note: relies on the fact that tlb_fill() does not remove the
> -         * previous page from the TLB cache.  */
> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> -            /* Little-endian extract.  */
> -            uint8_t val8 = val >> (i * 8);
> -            /* Note the adjustment at the beginning of the function.
> -               Undo that for the recursion.  */
> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> -                                            oi, retaddr + GETPC_ADJ);
> -        }
> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> +                                                retaddr);
>          return;
>      }
>
> @@ -485,6 +497,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>  }
>
>  #if DATA_SIZE > 1
> +static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
> +                                                           DATA_TYPE val,
> +                                                           target_ulong addr,
> +                                                           TCGMemOpIdx oi,
> +                                                           unsigned mmu_idx,
> +                                                           uintptr_t retaddr)
> +{
> +    int i;
> +
> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                             mmu_idx, retaddr);
> +    }
> +    /* XXX: not efficient, but simple */
> +    /* Note: relies on the fact that tlb_fill() does not remove the
> +     * previous page from the TLB cache.  */
> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
> +        /* Big-endian extract.  */
> +        uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> +        /* Note the adjustment at the beginning of the function.
> +           Undo that for the recursion.  */
> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> +                                        oi, retaddr + GETPC_ADJ);
> +    }
> +}

Not that it matters if you combine to two as suggested because anything
not called shouldn't generate the code.

>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
> @@ -548,7 +586,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>              return;
>          } else {
>              if ((addr & (DATA_SIZE - 1)) != 0) {
> -                goto do_unaligned_access;
> +                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> +                                                        oi, retaddr);
>              }
>              iotlbentry = &env->iotlb[mmu_idx][index];
>
> @@ -564,23 +603,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      if (DATA_SIZE > 1
>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>                       >= TARGET_PAGE_SIZE)) {
> -        int i;
> -    do_unaligned_access:
> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                                 mmu_idx, retaddr);
> -        }
> -        /* XXX: not efficient, but simple */
> -        /* Note: relies on the fact that tlb_fill() does not remove the
> -         * previous page from the TLB cache.  */
> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> -            /* Big-endian extract.  */
> -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> -            /* Note the adjustment at the beginning of the function.
> -               Undo that for the recursion.  */
> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> -                                            oi, retaddr + GETPC_ADJ);
> -        }
> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> +                                                retaddr);
>          return;
>      }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2016-01-07 14:46   ` Alex Bennée
@ 2016-01-07 15:09     ` alvise rigo
  2016-01-07 16:35       ` Alex Bennée
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2016-01-07 15:09 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

On Thu, Jan 7, 2016 at 3:46 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Attempting to simplify the helper_*_st_name, wrap the
>> do_unaligned_access code into an inline function.
>> Remove also the goto statement.
>
> As I said in the other thread I think these sort of clean-ups can come
> before the ll/sc implementations and potentially get merged ahead of the
> rest of it.
>
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  softmmu_template.h | 96 ++++++++++++++++++++++++++++++++++--------------------
>>  1 file changed, 60 insertions(+), 36 deletions(-)
>>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index d3d5902..92f92b1 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>>                                   iotlbentry->attrs);
>>  }
>>
>> +static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
>> +                                                           DATA_TYPE val,
>> +                                                           target_ulong addr,
>> +                                                           TCGMemOpIdx oi,
>> +                                                           unsigned mmu_idx,
>> +                                                           uintptr_t retaddr)
>> +{
>> +    int i;
>> +
>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> +                             mmu_idx, retaddr);
>> +    }
>> +    /* XXX: not efficient, but simple */
>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>> +     * previous page from the TLB cache.  */
>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>> +        /* Little-endian extract.  */
>> +        uint8_t val8 = val >> (i * 8);
>> +        /* Note the adjustment at the beginning of the function.
>> +           Undo that for the recursion.  */
>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>> +                                        oi, retaddr + GETPC_ADJ);
>> +    }
>> +}
>
> There is still duplication of 99% of the code here which is silly given

Then why should we keep this template-like design in the first place?
I tried to keep the code duplication for performance reasons
(otherwise how can we justify the two almost identical versions of the
helper?), while making the code more compact and readable.

> the compiler inlines the code anyway. If we gave the helper a more
> generic name and passed the endianess in via args I would hope the
> compiler did the sensible thing and constant fold the code. Something
> like:
>
> static inline void glue(helper_generic_st_name, _do_unl_access)
>                         (CPUArchState *env,
>                         bool little_endian,
>                         DATA_TYPE val,
>                         target_ulong addr,
>                         TCGMemOpIdx oi,
>                         unsigned mmu_idx,
>                         uintptr_t retaddr)
> {
>     int i;
>
>     if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>         cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>                              mmu_idx, retaddr);
>     }
>     /* Note: relies on the fact that tlb_fill() does not remove the
>      * previous page from the TLB cache.  */
>     for (i = DATA_SIZE - 1; i >= 0; i--) {
>         if (little_endian) {

little_endian will have >99% of the time the same value, does it make
sense to have a branch here?

Thank you,
alvise

>                 /* Little-endian extract.  */
>                 uint8_t val8 = val >> (i * 8);
>         } else {
>                 /* Big-endian extract.  */
>                 uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>         }
>         /* Note the adjustment at the beginning of the function.
>            Undo that for the recursion.  */
>         glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>                                         oi, retaddr + GETPC_ADJ);
>     }
> }
>
>
>> +
>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>  {
>> @@ -433,7 +459,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>              return;
>>          } else {
>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>> -                goto do_unaligned_access;
>> +                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> +                                                        oi, retaddr);
>>              }
>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>
>> @@ -449,23 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>      if (DATA_SIZE > 1
>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>                       >= TARGET_PAGE_SIZE)) {
>> -        int i;
>> -    do_unaligned_access:
>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> -                                 mmu_idx, retaddr);
>> -        }
>> -        /* XXX: not efficient, but simple */
>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>> -         * previous page from the TLB cache.  */
>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>> -            /* Little-endian extract.  */
>> -            uint8_t val8 = val >> (i * 8);
>> -            /* Note the adjustment at the beginning of the function.
>> -               Undo that for the recursion.  */
>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>> -                                            oi, retaddr + GETPC_ADJ);
>> -        }
>> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>> +                                                retaddr);
>>          return;
>>      }
>>
>> @@ -485,6 +497,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>  }
>>
>>  #if DATA_SIZE > 1
>> +static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
>> +                                                           DATA_TYPE val,
>> +                                                           target_ulong addr,
>> +                                                           TCGMemOpIdx oi,
>> +                                                           unsigned mmu_idx,
>> +                                                           uintptr_t retaddr)
>> +{
>> +    int i;
>> +
>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> +                             mmu_idx, retaddr);
>> +    }
>> +    /* XXX: not efficient, but simple */
>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>> +     * previous page from the TLB cache.  */
>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>> +        /* Big-endian extract.  */
>> +        uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>> +        /* Note the adjustment at the beginning of the function.
>> +           Undo that for the recursion.  */
>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>> +                                        oi, retaddr + GETPC_ADJ);
>> +    }
>> +}
>
> Not that it matters if you combine to two as suggested because anything
> not called shouldn't generate the code.
>
>>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>  {
>> @@ -548,7 +586,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>              return;
>>          } else {
>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>> -                goto do_unaligned_access;
>> +                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> +                                                        oi, retaddr);
>>              }
>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>
>> @@ -564,23 +603,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>      if (DATA_SIZE > 1
>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>                       >= TARGET_PAGE_SIZE)) {
>> -        int i;
>> -    do_unaligned_access:
>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>> -                                 mmu_idx, retaddr);
>> -        }
>> -        /* XXX: not efficient, but simple */
>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>> -         * previous page from the TLB cache.  */
>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>> -            /* Big-endian extract.  */
>> -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>> -            /* Note the adjustment at the beginning of the function.
>> -               Undo that for the recursion.  */
>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>> -                                            oi, retaddr + GETPC_ADJ);
>> -        }
>> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>> +                                                retaddr);
>>          return;
>>      }
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2016-01-07 15:09     ` alvise rigo
@ 2016-01-07 16:35       ` Alex Bennée
  2016-01-07 16:54         ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2016-01-07 16:35 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson


alvise rigo <a.rigo@virtualopensystems.com> writes:

> On Thu, Jan 7, 2016 at 3:46 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>>
>>> Attempting to simplify the helper_*_st_name, wrap the
>>> do_unaligned_access code into an inline function.
>>> Remove also the goto statement.
>>
>> As I said in the other thread I think these sort of clean-ups can come
>> before the ll/sc implementations and potentially get merged ahead of the
>> rest of it.
>>
>>>
>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>> ---
>>>  softmmu_template.h | 96 ++++++++++++++++++++++++++++++++++--------------------
>>>  1 file changed, 60 insertions(+), 36 deletions(-)
>>>
>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>> index d3d5902..92f92b1 100644
>>> --- a/softmmu_template.h
>>> +++ b/softmmu_template.h
>>> @@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>>>                                   iotlbentry->attrs);
>>>  }
>>>
>>> +static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
>>> +                                                           DATA_TYPE val,
>>> +                                                           target_ulong addr,
>>> +                                                           TCGMemOpIdx oi,
>>> +                                                           unsigned mmu_idx,
>>> +                                                           uintptr_t retaddr)
>>> +{
>>> +    int i;
>>> +
>>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> +                             mmu_idx, retaddr);
>>> +    }
>>> +    /* XXX: not efficient, but simple */
>>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>>> +     * previous page from the TLB cache.  */
>>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>>> +        /* Little-endian extract.  */
>>> +        uint8_t val8 = val >> (i * 8);
>>> +        /* Note the adjustment at the beginning of the function.
>>> +           Undo that for the recursion.  */
>>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>> +                                        oi, retaddr + GETPC_ADJ);
>>> +    }
>>> +}
>>
>> There is still duplication of 99% of the code here which is silly given
>
> Then why should we keep this template-like design in the first place?
> I tried to keep the code duplication for performance reasons
> (otherwise how can we justify the two almost identical versions of the
> helper?), while making the code more compact and readable.

We shouldn't really - code duplication is bad for all the well known
reasons. The main reason we need explicit helpers for the be/le case are
because they are called directly from the TCG code which encodes the
endianess decision in the call it makes. However that doesn't stop us
making generic inline helpers (helpers for the helpers ;-) which the
compiler can sort out.

>
>> the compiler inlines the code anyway. If we gave the helper a more
>> generic name and passed the endianess in via args I would hope the
>> compiler did the sensible thing and constant fold the code. Something
>> like:
>>
>> static inline void glue(helper_generic_st_name, _do_unl_access)
>>                         (CPUArchState *env,
>>                         bool little_endian,
>>                         DATA_TYPE val,
>>                         target_ulong addr,
>>                         TCGMemOpIdx oi,
>>                         unsigned mmu_idx,
>>                         uintptr_t retaddr)
>> {
>>     int i;
>>
>>     if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>         cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>                              mmu_idx, retaddr);
>>     }
>>     /* Note: relies on the fact that tlb_fill() does not remove the
>>      * previous page from the TLB cache.  */
>>     for (i = DATA_SIZE - 1; i >= 0; i--) {
>>         if (little_endian) {
>
> little_endian will have >99% of the time the same value, does it make
> sense to have a branch here?

The compiler should detect that little_endian is constant when it
inlines the code and not bother generating a test/branch case for
something that will never happen.

Even if it did though I doubt a local branch would stall the processor
that much, have you counted how many instructions we execute once we are
on the slow path?

>
> Thank you,
> alvise
>
>>                 /* Little-endian extract.  */
>>                 uint8_t val8 = val >> (i * 8);
>>         } else {
>>                 /* Big-endian extract.  */
>>                 uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>         }
>>         /* Note the adjustment at the beginning of the function.
>>            Undo that for the recursion.  */
>>         glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>                                         oi, retaddr + GETPC_ADJ);
>>     }
>> }
>>
>>
>>> +
>>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>  {
>>> @@ -433,7 +459,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>              return;
>>>          } else {
>>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>> -                goto do_unaligned_access;
>>> +                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>>> +                                                        oi, retaddr);
>>>              }
>>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>>
>>> @@ -449,23 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>      if (DATA_SIZE > 1
>>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>>                       >= TARGET_PAGE_SIZE)) {
>>> -        int i;
>>> -    do_unaligned_access:
>>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> -                                 mmu_idx, retaddr);
>>> -        }
>>> -        /* XXX: not efficient, but simple */
>>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>>> -         * previous page from the TLB cache.  */
>>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>>> -            /* Little-endian extract.  */
>>> -            uint8_t val8 = val >> (i * 8);
>>> -            /* Note the adjustment at the beginning of the function.
>>> -               Undo that for the recursion.  */
>>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>> -                                            oi, retaddr + GETPC_ADJ);
>>> -        }
>>> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>> +                                                retaddr);
>>>          return;
>>>      }
>>>
>>> @@ -485,6 +497,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>  }
>>>
>>>  #if DATA_SIZE > 1
>>> +static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
>>> +                                                           DATA_TYPE val,
>>> +                                                           target_ulong addr,
>>> +                                                           TCGMemOpIdx oi,
>>> +                                                           unsigned mmu_idx,
>>> +                                                           uintptr_t retaddr)
>>> +{
>>> +    int i;
>>> +
>>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> +                             mmu_idx, retaddr);
>>> +    }
>>> +    /* XXX: not efficient, but simple */
>>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>>> +     * previous page from the TLB cache.  */
>>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>>> +        /* Big-endian extract.  */
>>> +        uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>> +        /* Note the adjustment at the beginning of the function.
>>> +           Undo that for the recursion.  */
>>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>> +                                        oi, retaddr + GETPC_ADJ);
>>> +    }
>>> +}
>>
>> Not that it matters if you combine to two as suggested because anything
>> not called shouldn't generate the code.
>>
>>>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>  {
>>> @@ -548,7 +586,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>              return;
>>>          } else {
>>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>> -                goto do_unaligned_access;
>>> +                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>>> +                                                        oi, retaddr);
>>>              }
>>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>>
>>> @@ -564,23 +603,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>      if (DATA_SIZE > 1
>>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>>                       >= TARGET_PAGE_SIZE)) {
>>> -        int i;
>>> -    do_unaligned_access:
>>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>> -                                 mmu_idx, retaddr);
>>> -        }
>>> -        /* XXX: not efficient, but simple */
>>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>>> -         * previous page from the TLB cache.  */
>>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>>> -            /* Big-endian extract.  */
>>> -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>> -            /* Note the adjustment at the beginning of the function.
>>> -               Undo that for the recursion.  */
>>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>> -                                            oi, retaddr + GETPC_ADJ);
>>> -        }
>>> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>> +                                                retaddr);
>>>          return;
>>>      }
>>
>>
>> --
>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2016-01-07 16:35       ` Alex Bennée
@ 2016-01-07 16:54         ` alvise rigo
  2016-01-07 17:36           ` Alex Bennée
  0 siblings, 1 reply; 60+ messages in thread
From: alvise rigo @ 2016-01-07 16:54 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

On Thu, Jan 7, 2016 at 5:35 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> alvise rigo <a.rigo@virtualopensystems.com> writes:
>
>> On Thu, Jan 7, 2016 at 3:46 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>
>>> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>>>
>>>> Attempting to simplify the helper_*_st_name, wrap the
>>>> do_unaligned_access code into an inline function.
>>>> Remove also the goto statement.
>>>
>>> As I said in the other thread I think these sort of clean-ups can come
>>> before the ll/sc implementations and potentially get merged ahead of the
>>> rest of it.
>>>
>>>>
>>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>>> ---
>>>>  softmmu_template.h | 96 ++++++++++++++++++++++++++++++++++--------------------
>>>>  1 file changed, 60 insertions(+), 36 deletions(-)
>>>>
>>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>>> index d3d5902..92f92b1 100644
>>>> --- a/softmmu_template.h
>>>> +++ b/softmmu_template.h
>>>> @@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>>>>                                   iotlbentry->attrs);
>>>>  }
>>>>
>>>> +static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
>>>> +                                                           DATA_TYPE val,
>>>> +                                                           target_ulong addr,
>>>> +                                                           TCGMemOpIdx oi,
>>>> +                                                           unsigned mmu_idx,
>>>> +                                                           uintptr_t retaddr)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>> +                             mmu_idx, retaddr);
>>>> +    }
>>>> +    /* XXX: not efficient, but simple */
>>>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>>>> +     * previous page from the TLB cache.  */
>>>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>> +        /* Little-endian extract.  */
>>>> +        uint8_t val8 = val >> (i * 8);
>>>> +        /* Note the adjustment at the beginning of the function.
>>>> +           Undo that for the recursion.  */
>>>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>> +                                        oi, retaddr + GETPC_ADJ);
>>>> +    }
>>>> +}
>>>
>>> There is still duplication of 99% of the code here which is silly given
>>
>> Then why should we keep this template-like design in the first place?
>> I tried to keep the code duplication for performance reasons
>> (otherwise how can we justify the two almost identical versions of the
>> helper?), while making the code more compact and readable.
>
> We shouldn't really - code duplication is bad for all the well known
> reasons. The main reason we need explicit helpers for the be/le case are
> because they are called directly from the TCG code which encodes the
> endianess decision in the call it makes. However that doesn't stop us
> making generic inline helpers (helpers for the helpers ;-) which the
> compiler can sort out.

I thought you wanted to make conditional all the le/be differences not
just those helpers for the helpers...
So, if we are allowed to introduce this small overhead, all the
helper_{le,be}_st_name_do_{unl,mmio,ram}_access can be squashed to
helper_generic_st_do_{unl,mmio,ram}_access. I think this is want you
proposed in the POC, right?

>
>>
>>> the compiler inlines the code anyway. If we gave the helper a more
>>> generic name and passed the endianess in via args I would hope the
>>> compiler did the sensible thing and constant fold the code. Something
>>> like:
>>>
>>> static inline void glue(helper_generic_st_name, _do_unl_access)
>>>                         (CPUArchState *env,
>>>                         bool little_endian,
>>>                         DATA_TYPE val,
>>>                         target_ulong addr,
>>>                         TCGMemOpIdx oi,
>>>                         unsigned mmu_idx,
>>>                         uintptr_t retaddr)
>>> {
>>>     int i;
>>>
>>>     if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>         cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>                              mmu_idx, retaddr);
>>>     }
>>>     /* Note: relies on the fact that tlb_fill() does not remove the
>>>      * previous page from the TLB cache.  */
>>>     for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>         if (little_endian) {
>>
>> little_endian will have >99% of the time the same value, does it make
>> sense to have a branch here?
>
> The compiler should detect that little_endian is constant when it
> inlines the code and not bother generating a test/branch case for
> something that will never happen.
>
> Even if it did though I doubt a local branch would stall the processor
> that much, have you counted how many instructions we execute once we are
> on the slow path?

Too many :)

Regards,
alvise

>
>>
>> Thank you,
>> alvise
>>
>>>                 /* Little-endian extract.  */
>>>                 uint8_t val8 = val >> (i * 8);
>>>         } else {
>>>                 /* Big-endian extract.  */
>>>                 uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>>         }
>>>         /* Note the adjustment at the beginning of the function.
>>>            Undo that for the recursion.  */
>>>         glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>                                         oi, retaddr + GETPC_ADJ);
>>>     }
>>> }
>>>
>>>
>>>> +
>>>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>>  {
>>>> @@ -433,7 +459,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>              return;
>>>>          } else {
>>>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>>> -                goto do_unaligned_access;
>>>> +                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>>>> +                                                        oi, retaddr);
>>>>              }
>>>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>>>
>>>> @@ -449,23 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>      if (DATA_SIZE > 1
>>>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>>>                       >= TARGET_PAGE_SIZE)) {
>>>> -        int i;
>>>> -    do_unaligned_access:
>>>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>> -                                 mmu_idx, retaddr);
>>>> -        }
>>>> -        /* XXX: not efficient, but simple */
>>>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>>>> -         * previous page from the TLB cache.  */
>>>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>> -            /* Little-endian extract.  */
>>>> -            uint8_t val8 = val >> (i * 8);
>>>> -            /* Note the adjustment at the beginning of the function.
>>>> -               Undo that for the recursion.  */
>>>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>> -                                            oi, retaddr + GETPC_ADJ);
>>>> -        }
>>>> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>>> +                                                retaddr);
>>>>          return;
>>>>      }
>>>>
>>>> @@ -485,6 +497,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>  }
>>>>
>>>>  #if DATA_SIZE > 1
>>>> +static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
>>>> +                                                           DATA_TYPE val,
>>>> +                                                           target_ulong addr,
>>>> +                                                           TCGMemOpIdx oi,
>>>> +                                                           unsigned mmu_idx,
>>>> +                                                           uintptr_t retaddr)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>> +                             mmu_idx, retaddr);
>>>> +    }
>>>> +    /* XXX: not efficient, but simple */
>>>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>>>> +     * previous page from the TLB cache.  */
>>>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>> +        /* Big-endian extract.  */
>>>> +        uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>>> +        /* Note the adjustment at the beginning of the function.
>>>> +           Undo that for the recursion.  */
>>>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>> +                                        oi, retaddr + GETPC_ADJ);
>>>> +    }
>>>> +}
>>>
>>> Not that it matters if you combine to two as suggested because anything
>>> not called shouldn't generate the code.
>>>
>>>>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>>  {
>>>> @@ -548,7 +586,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>              return;
>>>>          } else {
>>>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>>> -                goto do_unaligned_access;
>>>> +                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>>>> +                                                        oi, retaddr);
>>>>              }
>>>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>>>
>>>> @@ -564,23 +603,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>      if (DATA_SIZE > 1
>>>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>>>                       >= TARGET_PAGE_SIZE)) {
>>>> -        int i;
>>>> -    do_unaligned_access:
>>>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>> -                                 mmu_idx, retaddr);
>>>> -        }
>>>> -        /* XXX: not efficient, but simple */
>>>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>>>> -         * previous page from the TLB cache.  */
>>>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>> -            /* Big-endian extract.  */
>>>> -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>>> -            /* Note the adjustment at the beginning of the function.
>>>> -               Undo that for the recursion.  */
>>>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>> -                                            oi, retaddr + GETPC_ADJ);
>>>> -        }
>>>> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>>> +                                                retaddr);
>>>>          return;
>>>>      }
>>>
>>>
>>> --
>>> Alex Bennée
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2016-01-07 16:54         ` alvise rigo
@ 2016-01-07 17:36           ` Alex Bennée
  0 siblings, 0 replies; 60+ messages in thread
From: Alex Bennée @ 2016-01-07 17:36 UTC (permalink / raw)
  To: alvise rigo
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson


alvise rigo <a.rigo@virtualopensystems.com> writes:

> On Thu, Jan 7, 2016 at 5:35 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>
>> alvise rigo <a.rigo@virtualopensystems.com> writes:
>>
>>> On Thu, Jan 7, 2016 at 3:46 PM, Alex Bennée <alex.bennee@linaro.org> wrote:
>>>>
>>>> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>>>>
>>>>> Attempting to simplify the helper_*_st_name, wrap the
>>>>> do_unaligned_access code into an inline function.
>>>>> Remove also the goto statement.
>>>>
>>>> As I said in the other thread I think these sort of clean-ups can come
>>>> before the ll/sc implementations and potentially get merged ahead of the
>>>> rest of it.
>>>>
>>>>>
>>>>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>>>>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>>>>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>>>>> ---
>>>>>  softmmu_template.h | 96 ++++++++++++++++++++++++++++++++++--------------------
>>>>>  1 file changed, 60 insertions(+), 36 deletions(-)
>>>>>
>>>>> diff --git a/softmmu_template.h b/softmmu_template.h
>>>>> index d3d5902..92f92b1 100644
>>>>> --- a/softmmu_template.h
>>>>> +++ b/softmmu_template.h
>>>>> @@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>>>>>                                   iotlbentry->attrs);
>>>>>  }
>>>>>
>>>>> +static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
>>>>> +                                                           DATA_TYPE val,
>>>>> +                                                           target_ulong addr,
>>>>> +                                                           TCGMemOpIdx oi,
>>>>> +                                                           unsigned mmu_idx,
>>>>> +                                                           uintptr_t retaddr)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>>> +                             mmu_idx, retaddr);
>>>>> +    }
>>>>> +    /* XXX: not efficient, but simple */
>>>>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>>>>> +     * previous page from the TLB cache.  */
>>>>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>>> +        /* Little-endian extract.  */
>>>>> +        uint8_t val8 = val >> (i * 8);
>>>>> +        /* Note the adjustment at the beginning of the function.
>>>>> +           Undo that for the recursion.  */
>>>>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>>> +                                        oi, retaddr + GETPC_ADJ);
>>>>> +    }
>>>>> +}
>>>>
>>>> There is still duplication of 99% of the code here which is silly given
>>>
>>> Then why should we keep this template-like design in the first place?
>>> I tried to keep the code duplication for performance reasons
>>> (otherwise how can we justify the two almost identical versions of the
>>> helper?), while making the code more compact and readable.
>>
>> We shouldn't really - code duplication is bad for all the well known
>> reasons. The main reason we need explicit helpers for the be/le case are
>> because they are called directly from the TCG code which encodes the
>> endianess decision in the call it makes. However that doesn't stop us
>> making generic inline helpers (helpers for the helpers ;-) which the
>> compiler can sort out.
>
> I thought you wanted to make conditional all the le/be differences not
> just those helpers for the helpers...

That would be nice for it all but that involves tweaking the TCG->helper
calls themselves. However if we are re-factoring common stuff from those
helpers into inlines then we can at least reduce the duplication there.

> So, if we are allowed to introduce this small overhead, all the
> helper_{le,be}_st_name_do_{unl,mmio,ram}_access can be squashed to
> helper_generic_st_do_{unl,mmio,ram}_access. I think this is want you
> proposed in the POC, right?

Well in theory it shouldn't introduce any overhead. However my proof is
currently waiting on a bug fix to GDB's dissas command so I can show you
the side by side assembly dump ;-)

>>>> the compiler inlines the code anyway. If we gave the helper a more
>>>> generic name and passed the endianess in via args I would hope the
>>>> compiler did the sensible thing and constant fold the code. Something
>>>> like:
>>>>
>>>> static inline void glue(helper_generic_st_name, _do_unl_access)
>>>>                         (CPUArchState *env,
>>>>                         bool little_endian,
>>>>                         DATA_TYPE val,
>>>>                         target_ulong addr,
>>>>                         TCGMemOpIdx oi,
>>>>                         unsigned mmu_idx,
>>>>                         uintptr_t retaddr)
>>>> {
>>>>     int i;
>>>>
>>>>     if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>>         cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>>                              mmu_idx, retaddr);
>>>>     }
>>>>     /* Note: relies on the fact that tlb_fill() does not remove the
>>>>      * previous page from the TLB cache.  */
>>>>     for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>>         if (little_endian) {
>>>
>>> little_endian will have >99% of the time the same value, does it make
>>> sense to have a branch here?
>>
>> The compiler should detect that little_endian is constant when it
>> inlines the code and not bother generating a test/branch case for
>> something that will never happen.
>>
>> Even if it did though I doubt a local branch would stall the processor
>> that much, have you counted how many instructions we execute once we are
>> on the slow path?
>
> Too many :)

Indeed, that is why its SLOOOOW ;-)
>
> Regards,
> alvise
>
>>
>>>
>>> Thank you,
>>> alvise
>>>
>>>>                 /* Little-endian extract.  */
>>>>                 uint8_t val8 = val >> (i * 8);
>>>>         } else {
>>>>                 /* Big-endian extract.  */
>>>>                 uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>>>         }
>>>>         /* Note the adjustment at the beginning of the function.
>>>>            Undo that for the recursion.  */
>>>>         glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>>                                         oi, retaddr + GETPC_ADJ);
>>>>     }
>>>> }
>>>>
>>>>
>>>>> +
>>>>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>>>  {
>>>>> @@ -433,7 +459,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>              return;
>>>>>          } else {
>>>>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>>>> -                goto do_unaligned_access;
>>>>> +                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>>>>> +                                                        oi, retaddr);
>>>>>              }
>>>>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>>>>
>>>>> @@ -449,23 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>      if (DATA_SIZE > 1
>>>>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>>>>                       >= TARGET_PAGE_SIZE)) {
>>>>> -        int i;
>>>>> -    do_unaligned_access:
>>>>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>>> -                                 mmu_idx, retaddr);
>>>>> -        }
>>>>> -        /* XXX: not efficient, but simple */
>>>>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>>>>> -         * previous page from the TLB cache.  */
>>>>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>>> -            /* Little-endian extract.  */
>>>>> -            uint8_t val8 = val >> (i * 8);
>>>>> -            /* Note the adjustment at the beginning of the function.
>>>>> -               Undo that for the recursion.  */
>>>>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>>> -                                            oi, retaddr + GETPC_ADJ);
>>>>> -        }
>>>>> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>>>> +                                                retaddr);
>>>>>          return;
>>>>>      }
>>>>>
>>>>> @@ -485,6 +497,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>  }
>>>>>
>>>>>  #if DATA_SIZE > 1
>>>>> +static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
>>>>> +                                                           DATA_TYPE val,
>>>>> +                                                           target_ulong addr,
>>>>> +                                                           TCGMemOpIdx oi,
>>>>> +                                                           unsigned mmu_idx,
>>>>> +                                                           uintptr_t retaddr)
>>>>> +{
>>>>> +    int i;
>>>>> +
>>>>> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>>> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>>> +                             mmu_idx, retaddr);
>>>>> +    }
>>>>> +    /* XXX: not efficient, but simple */
>>>>> +    /* Note: relies on the fact that tlb_fill() does not remove the
>>>>> +     * previous page from the TLB cache.  */
>>>>> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>>> +        /* Big-endian extract.  */
>>>>> +        uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>>>> +        /* Note the adjustment at the beginning of the function.
>>>>> +           Undo that for the recursion.  */
>>>>> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>>> +                                        oi, retaddr + GETPC_ADJ);
>>>>> +    }
>>>>> +}
>>>>
>>>> Not that it matters if you combine to two as suggested because anything
>>>> not called shouldn't generate the code.
>>>>
>>>>>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>>>>  {
>>>>> @@ -548,7 +586,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>              return;
>>>>>          } else {
>>>>>              if ((addr & (DATA_SIZE - 1)) != 0) {
>>>>> -                goto do_unaligned_access;
>>>>> +                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>>>>> +                                                        oi, retaddr);
>>>>>              }
>>>>>              iotlbentry = &env->iotlb[mmu_idx][index];
>>>>>
>>>>> @@ -564,23 +603,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>>>>      if (DATA_SIZE > 1
>>>>>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>>>>>                       >= TARGET_PAGE_SIZE)) {
>>>>> -        int i;
>>>>> -    do_unaligned_access:
>>>>> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
>>>>> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
>>>>> -                                 mmu_idx, retaddr);
>>>>> -        }
>>>>> -        /* XXX: not efficient, but simple */
>>>>> -        /* Note: relies on the fact that tlb_fill() does not remove the
>>>>> -         * previous page from the TLB cache.  */
>>>>> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
>>>>> -            /* Big-endian extract.  */
>>>>> -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
>>>>> -            /* Note the adjustment at the beginning of the function.
>>>>> -               Undo that for the recursion.  */
>>>>> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
>>>>> -                                            oi, retaddr + GETPC_ADJ);
>>>>> -        }
>>>>> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
>>>>> +                                                retaddr);
>>>>>          return;
>>>>>      }
>>>>
>>>>
>>>> --
>>>> Alex Bennée
>>
>>
>> --
>> Alex Bennée


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
  2016-01-07 14:46   ` Alex Bennée
@ 2016-01-08 11:19   ` Alex Bennée
  1 sibling, 0 replies; 60+ messages in thread
From: Alex Bennée @ 2016-01-08 11:19 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Attempting to simplify the helper_*_st_name, wrap the
> do_unaligned_access code into an inline function.
> Remove also the goto statement.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  softmmu_template.h | 96 ++++++++++++++++++++++++++++++++++--------------------
>  1 file changed, 60 insertions(+), 36 deletions(-)
>
> diff --git a/softmmu_template.h b/softmmu_template.h
> index d3d5902..92f92b1 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -370,6 +370,32 @@ static inline void glue(io_write, SUFFIX)(CPUArchState *env,
>                                   iotlbentry->attrs);
>  }
>
> +static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
> +                                                           DATA_TYPE val,
> +                                                           target_ulong addr,
> +                                                           TCGMemOpIdx oi,
> +                                                           unsigned mmu_idx,
> +                                                           uintptr_t retaddr)
> +{
> +    int i;
> +
> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                             mmu_idx, retaddr);
> +    }
> +    /* XXX: not efficient, but simple */
> +    /* Note: relies on the fact that tlb_fill() does not remove the
> +     * previous page from the TLB cache.  */
> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
> +        /* Little-endian extract.  */
> +        uint8_t val8 = val >> (i * 8);
> +        /* Note the adjustment at the beginning of the function.
> +           Undo that for the recursion.  */
> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> +                                        oi, retaddr + GETPC_ADJ);
> +    }
> +}
> +
>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
> @@ -433,7 +459,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>              return;
>          } else {
>              if ((addr & (DATA_SIZE - 1)) != 0) {
> -                goto do_unaligned_access;
> +                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> +                                                        oi, retaddr);

I've just noticed this drops and implicit return. However I'm seeing if
I can put together an RFC cleanup patch set for softmmu based on these
plus a few other clean-ups. I'll CC when done.

>              }
>              iotlbentry = &env->iotlb[mmu_idx][index];
>
> @@ -449,23 +476,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      if (DATA_SIZE > 1
>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>                       >= TARGET_PAGE_SIZE)) {
> -        int i;
> -    do_unaligned_access:
> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                                 mmu_idx, retaddr);
> -        }
> -        /* XXX: not efficient, but simple */
> -        /* Note: relies on the fact that tlb_fill() does not remove the
> -         * previous page from the TLB cache.  */
> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> -            /* Little-endian extract.  */
> -            uint8_t val8 = val >> (i * 8);
> -            /* Note the adjustment at the beginning of the function.
> -               Undo that for the recursion.  */
> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> -                                            oi, retaddr + GETPC_ADJ);
> -        }
> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> +                                                retaddr);
>          return;
>      }
>
> @@ -485,6 +497,32 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>  }
>
>  #if DATA_SIZE > 1
> +static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
> +                                                           DATA_TYPE val,
> +                                                           target_ulong addr,
> +                                                           TCGMemOpIdx oi,
> +                                                           unsigned mmu_idx,
> +                                                           uintptr_t retaddr)
> +{
> +    int i;
> +
> +    if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> +        cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> +                             mmu_idx, retaddr);
> +    }
> +    /* XXX: not efficient, but simple */
> +    /* Note: relies on the fact that tlb_fill() does not remove the
> +     * previous page from the TLB cache.  */
> +    for (i = DATA_SIZE - 1; i >= 0; i--) {
> +        /* Big-endian extract.  */
> +        uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> +        /* Note the adjustment at the beginning of the function.
> +           Undo that for the recursion.  */
> +        glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> +                                        oi, retaddr + GETPC_ADJ);
> +    }
> +}
> +
>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
> @@ -548,7 +586,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>              return;
>          } else {
>              if ((addr & (DATA_SIZE - 1)) != 0) {
> -                goto do_unaligned_access;
> +                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> +                                                        oi, retaddr);
>              }
>              iotlbentry = &env->iotlb[mmu_idx][index];
>
> @@ -564,23 +603,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>      if (DATA_SIZE > 1
>          && unlikely((addr & ~TARGET_PAGE_MASK) + DATA_SIZE - 1
>                       >= TARGET_PAGE_SIZE)) {
> -        int i;
> -    do_unaligned_access:
> -        if ((get_memop(oi) & MO_AMASK) == MO_ALIGN) {
> -            cpu_unaligned_access(ENV_GET_CPU(env), addr, MMU_DATA_STORE,
> -                                 mmu_idx, retaddr);
> -        }
> -        /* XXX: not efficient, but simple */
> -        /* Note: relies on the fact that tlb_fill() does not remove the
> -         * previous page from the TLB cache.  */
> -        for (i = DATA_SIZE - 1; i >= 0; i--) {
> -            /* Big-endian extract.  */
> -            uint8_t val8 = val >> (((DATA_SIZE - 1) * 8) - (i * 8));
> -            /* Note the adjustment at the beginning of the function.
> -               Undo that for the recursion.  */
> -            glue(helper_ret_stb, MMUSUFFIX)(env, addr + i, val8,
> -                                            oi, retaddr + GETPC_ADJ);
> -        }
> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, oi, mmu_idx,
> +                                                retaddr);
>          return;
>      }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code
  2015-12-14  8:41 ` [Qemu-devel] [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code Alvise Rigo
@ 2016-01-11  9:54   ` Alex Bennée
  2016-01-11 10:19     ` alvise rigo
  0 siblings, 1 reply; 60+ messages in thread
From: Alex Bennée @ 2016-01-11  9:54 UTC (permalink / raw)
  To: Alvise Rigo
  Cc: mttcg, claudio.fontana, qemu-devel, pbonzini, jani.kokkonen, tech, rth


Alvise Rigo <a.rigo@virtualopensystems.com> writes:

> Attempting to simplify the helper_*_st_name, wrap the MMIO code into an
> inline function.
>
> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
> ---
>  softmmu_template.h | 64 +++++++++++++++++++++++++++++++++++++-----------------
>  1 file changed, 44 insertions(+), 20 deletions(-)
>
> diff --git a/softmmu_template.h b/softmmu_template.h
> index 92f92b1..2ebf527 100644
> --- a/softmmu_template.h
> +++ b/softmmu_template.h
> @@ -396,6 +396,26 @@ static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
>      }
>  }
>
> +static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
> +                                                            DATA_TYPE val,
> +                                                            target_ulong addr,
> +                                                            TCGMemOpIdx oi,
> +                                                            unsigned mmu_idx,
> +                                                            int index,
> +                                                            uintptr_t retaddr)
> +{
> +    CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> +
> +    if ((addr & (DATA_SIZE - 1)) != 0) {
> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> +                                                oi, retaddr);
> +    }
> +    /* ??? Note that the io helpers always read data in the target
> +       byte ordering.  We should push the LE/BE request down into io.  */
> +    val = TGT_LE(val);
> +    glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> +}
> +

Some comment as previous patches. I think we can have a single function
that is shared between both helpers.

>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
> @@ -458,16 +478,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>
>              return;
>          } else {
> -            if ((addr & (DATA_SIZE - 1)) != 0) {
> -                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> -                                                        oi, retaddr);
> -            }
> -            iotlbentry = &env->iotlb[mmu_idx][index];
> -
> -            /* ??? Note that the io helpers always read data in the target
> -               byte ordering.  We should push the LE/BE request down into io.  */
> -            val = TGT_LE(val);
> -            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> +            glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
> +                                                     mmu_idx, index, retaddr);
>              return;
>          }
>      }
> @@ -523,6 +535,26 @@ static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
>      }
>  }
>
> +static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
> +                                                            DATA_TYPE val,
> +                                                            target_ulong addr,
> +                                                            TCGMemOpIdx oi,
> +                                                            unsigned mmu_idx,
> +                                                            int index,
> +                                                            uintptr_t retaddr)
> +{
> +    CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
> +
> +    if ((addr & (DATA_SIZE - 1)) != 0) {
> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> +                                                oi, retaddr);
> +    }
> +    /* ??? Note that the io helpers always read data in the target
> +       byte ordering.  We should push the LE/BE request down into io.  */
> +    val = TGT_BE(val);
> +    glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> +}
> +
>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>                         TCGMemOpIdx oi, uintptr_t retaddr)
>  {
> @@ -585,16 +617,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>
>              return;
>          } else {
> -            if ((addr & (DATA_SIZE - 1)) != 0) {
> -                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
> -                                                        oi, retaddr);
> -            }
> -            iotlbentry = &env->iotlb[mmu_idx][index];
> -
> -            /* ??? Note that the io helpers always read data in the target
> -               byte ordering.  We should push the LE/BE request down into io.  */
> -            val = TGT_BE(val);
> -            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
> +            glue(helper_be_st_name, _do_mmio_access)(env, val, addr, oi,
> +                                                     mmu_idx, index, retaddr);
>              return;
>          }
>      }


--
Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

* Re: [Qemu-devel] [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code
  2016-01-11  9:54   ` Alex Bennée
@ 2016-01-11 10:19     ` alvise rigo
  0 siblings, 0 replies; 60+ messages in thread
From: alvise rigo @ 2016-01-11 10:19 UTC (permalink / raw)
  To: Alex Bennée
  Cc: mttcg, Claudio Fontana, QEMU Developers, Paolo Bonzini,
	Jani Kokkonen, VirtualOpenSystems Technical Team,
	Richard Henderson

On Mon, Jan 11, 2016 at 10:54 AM, Alex Bennée <alex.bennee@linaro.org> wrote:
>
> Alvise Rigo <a.rigo@virtualopensystems.com> writes:
>
>> Attempting to simplify the helper_*_st_name, wrap the MMIO code into an
>> inline function.
>>
>> Suggested-by: Jani Kokkonen <jani.kokkonen@huawei.com>
>> Suggested-by: Claudio Fontana <claudio.fontana@huawei.com>
>> Signed-off-by: Alvise Rigo <a.rigo@virtualopensystems.com>
>> ---
>>  softmmu_template.h | 64 +++++++++++++++++++++++++++++++++++++-----------------
>>  1 file changed, 44 insertions(+), 20 deletions(-)
>>
>> diff --git a/softmmu_template.h b/softmmu_template.h
>> index 92f92b1..2ebf527 100644
>> --- a/softmmu_template.h
>> +++ b/softmmu_template.h
>> @@ -396,6 +396,26 @@ static inline void glue(helper_le_st_name, _do_unl_access)(CPUArchState *env,
>>      }
>>  }
>>
>> +static inline void glue(helper_le_st_name, _do_mmio_access)(CPUArchState *env,
>> +                                                            DATA_TYPE val,
>> +                                                            target_ulong addr,
>> +                                                            TCGMemOpIdx oi,
>> +                                                            unsigned mmu_idx,
>> +                                                            int index,
>> +                                                            uintptr_t retaddr)
>> +{
>> +    CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
>> +
>> +    if ((addr & (DATA_SIZE - 1)) != 0) {
>> +        glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> +                                                oi, retaddr);
>> +    }
>> +    /* ??? Note that the io helpers always read data in the target
>> +       byte ordering.  We should push the LE/BE request down into io.  */
>> +    val = TGT_LE(val);
>> +    glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> +}
>> +
>
> Some comment as previous patches. I think we can have a single function
> that is shared between both helpers.

Of course. If the objdump you got from this version and the version
with single helper is basically the same, then there's no reason to
make two distinct variants.

Thank you,
alvise

>
>>  void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>  {
>> @@ -458,16 +478,8 @@ void helper_le_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>
>>              return;
>>          } else {
>> -            if ((addr & (DATA_SIZE - 1)) != 0) {
>> -                glue(helper_le_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> -                                                        oi, retaddr);
>> -            }
>> -            iotlbentry = &env->iotlb[mmu_idx][index];
>> -
>> -            /* ??? Note that the io helpers always read data in the target
>> -               byte ordering.  We should push the LE/BE request down into io.  */
>> -            val = TGT_LE(val);
>> -            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> +            glue(helper_le_st_name, _do_mmio_access)(env, val, addr, oi,
>> +                                                     mmu_idx, index, retaddr);
>>              return;
>>          }
>>      }
>> @@ -523,6 +535,26 @@ static inline void glue(helper_be_st_name, _do_unl_access)(CPUArchState *env,
>>      }
>>  }
>>
>> +static inline void glue(helper_be_st_name, _do_mmio_access)(CPUArchState *env,
>> +                                                            DATA_TYPE val,
>> +                                                            target_ulong addr,
>> +                                                            TCGMemOpIdx oi,
>> +                                                            unsigned mmu_idx,
>> +                                                            int index,
>> +                                                            uintptr_t retaddr)
>> +{
>> +    CPUIOTLBEntry *iotlbentry = &env->iotlb[mmu_idx][index];
>> +
>> +    if ((addr & (DATA_SIZE - 1)) != 0) {
>> +        glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> +                                                oi, retaddr);
>> +    }
>> +    /* ??? Note that the io helpers always read data in the target
>> +       byte ordering.  We should push the LE/BE request down into io.  */
>> +    val = TGT_BE(val);
>> +    glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> +}
>> +
>>  void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>                         TCGMemOpIdx oi, uintptr_t retaddr)
>>  {
>> @@ -585,16 +617,8 @@ void helper_be_st_name(CPUArchState *env, target_ulong addr, DATA_TYPE val,
>>
>>              return;
>>          } else {
>> -            if ((addr & (DATA_SIZE - 1)) != 0) {
>> -                glue(helper_be_st_name, _do_unl_access)(env, val, addr, mmu_idx,
>> -                                                        oi, retaddr);
>> -            }
>> -            iotlbentry = &env->iotlb[mmu_idx][index];
>> -
>> -            /* ??? Note that the io helpers always read data in the target
>> -               byte ordering.  We should push the LE/BE request down into io.  */
>> -            val = TGT_BE(val);
>> -            glue(io_write, SUFFIX)(env, iotlbentry, val, addr, retaddr);
>> +            glue(helper_be_st_name, _do_mmio_access)(env, val, addr, oi,
>> +                                                     mmu_idx, index, retaddr);
>>              return;
>>          }
>>      }
>
>
> --
> Alex Bennée

^ permalink raw reply	[flat|nested] 60+ messages in thread

end of thread, other threads:[~2016-01-11 10:19 UTC | newest]

Thread overview: 60+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-12-14  8:41 [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Alvise Rigo
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 01/14] exec.c: Add new exclusive bitmap to ram_list Alvise Rigo
2015-12-18 13:18   ` Alex Bennée
2015-12-18 13:47     ` alvise rigo
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 02/14] softmmu: Add new TLB_EXCL flag Alvise Rigo
2016-01-05 16:10   ` Alex Bennée
2016-01-05 17:27     ` alvise rigo
2016-01-05 18:39       ` Alex Bennée
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 03/14] Add CPUClass hook to set exclusive range Alvise Rigo
2016-01-05 16:42   ` Alex Bennée
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 04/14] softmmu: Add helpers for a new slowpath Alvise Rigo
2016-01-06 15:16   ` Alex Bennée
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 05/14] tcg: Create new runtime helpers for excl accesses Alvise Rigo
2015-12-14  9:40   ` Paolo Bonzini
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 06/14] configure: Use slow-path for atomic only when the softmmu is enabled Alvise Rigo
2015-12-14  9:38   ` Paolo Bonzini
2015-12-14  9:39     ` Paolo Bonzini
2015-12-14 10:14   ` Laurent Vivier
2015-12-15 14:23     ` alvise rigo
2015-12-15 14:31       ` Paolo Bonzini
2015-12-15 15:18         ` Laurent Vivier
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 07/14] target-arm: translate: Use ld/st excl for atomic insns Alvise Rigo
2016-01-06 17:11   ` Alex Bennée
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 08/14] target-arm: Add atomic_clear helper for CLREX insn Alvise Rigo
2016-01-06 17:13   ` Alex Bennée
2016-01-06 17:27     ` alvise rigo
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 09/14] softmmu: Add history of excl accesses Alvise Rigo
2015-12-14  9:35   ` Paolo Bonzini
2015-12-15 14:26     ` alvise rigo
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 10/14] softmmu: Simplify helper_*_st_name, wrap unaligned code Alvise Rigo
2016-01-07 14:46   ` Alex Bennée
2016-01-07 15:09     ` alvise rigo
2016-01-07 16:35       ` Alex Bennée
2016-01-07 16:54         ` alvise rigo
2016-01-07 17:36           ` Alex Bennée
2016-01-08 11:19   ` Alex Bennée
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 11/14] softmmu: Simplify helper_*_st_name, wrap MMIO code Alvise Rigo
2016-01-11  9:54   ` Alex Bennée
2016-01-11 10:19     ` alvise rigo
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 12/14] softmmu: Simplify helper_*_st_name, wrap RAM code Alvise Rigo
2015-12-17 16:52   ` Alex Bennée
2015-12-17 17:13     ` alvise rigo
2015-12-17 20:20       ` Alex Bennée
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 13/14] softmmu: Include MMIO/invalid exclusive accesses Alvise Rigo
2015-12-14  8:41 ` [Qemu-devel] [RFC v6 14/14] softmmu: Protect MMIO exclusive range Alvise Rigo
2015-12-14  9:33 ` [Qemu-devel] [RFC v6 00/14] Slow-path for atomic instruction translation Paolo Bonzini
2015-12-14 10:04   ` alvise rigo
2015-12-14 10:17     ` Paolo Bonzini
2015-12-15 13:59       ` alvise rigo
2015-12-15 14:18         ` Paolo Bonzini
2015-12-15 14:22           ` alvise rigo
2015-12-14 22:09 ` Andreas Tobler
2015-12-15  8:16   ` alvise rigo
2015-12-17 16:06 ` Alex Bennée
2015-12-17 16:16   ` alvise rigo
2016-01-06 18:00 ` Andrew Baumann
2016-01-07 10:21   ` alvise rigo
2016-01-07 10:22     ` Peter Maydell
2016-01-07 10:49       ` alvise rigo
2016-01-07 11:16         ` Peter Maydell

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.